BITS Meetings' Virtual Library:
Abstracts from Italian Bioinformatics Meetings from 1999 to 2013


766 abstracts overall from 11 distinct proceedings





Display Abstracts | Brief :: Order by Meeting | First Author Name
1. Guigò R
Finding genes by comparing genomes: the case of selenoproteins
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Comparative genomics

Abstract: Although the genome sequence and gene content are available for an increasing number of organisms, eukaryotic selenoproteins remain poorly characterized. In these proteins, selenium (Se) is incorporated in the form of selenocysteine (Sec), the 21st amino acid. Selenocysteine is cotranslationally inserted in response to UGA codons (a stop signal in the canonical genetic code). The alternative decoding is mediated by a stem-loop structure in the 3'UTR of selenoprotein mRNAs (the SECIS element). Selenium is implicated in male infertility, cancer and heart diseases, viral expression and ageing. In addition, most selenoproteins have homologues in which Sec is replaced by cysteine (Cys). Genome biologists rely on the high-quality annotation of genomes to bridge the gap from the sequence to the biology of the organism. However, for selenoproteins, which mediate the biological functions of selenium, the dual role of the UGA codon confounds both the automatic annotation pipelines and the human curators. In consequence, selenoprotein are misannotated in the majority of genome projects. Furthermore, the finding of novel selenoprotein families remains a difficult task in the newly released genome sequences. In the last few years, we have contributed to the exhaustive description of the eukaryotic selenoproteome (set of eukaryotic selenoproteins) through the development of a number of ad hoc computational tools. Our approach is based on the capacity of predicting SECIS elements, standard genes and genes with a UGA codon in-frame in one or multiple genomes. Indeed, the comparative analysis plays an essential role because 1) SECIS sequences are conserved between close species (eg. human-mouse); and 2) sequence conservation across a UGA codon between genomes at further phylogenetic distance strongly suggests a coding function (eg. human-fugu). Our analysis of the fly, human and fugu genomes have resulted in 8 novel selenoprotein families. Therefore, 19 distinct selenoprotein families have been described in eukaryotes to date. Most of these families are widely (but not uniformly) distributed across eukaryotes, either as true selenoproteins or Cys-homologues. The recent completion of the Tetraodon nigroviridis and Fugu rubripes genomes has allowed us to investigate the eukaryotic selenoproteome in a restricted and largely unexplored window within the vertebrate phylogeny. Our investigation has resulted in the identification of a novel selenoprotein family, currently under study, which appears to be restricted to actinopterygians among vertebrates. The correct annotation of selenoproteins is thus providing insight into the evolution of the usage of Sec. Our data indicate a discrete evolutionary distribution of selenoproteins in eukaryotes and suggest that, contrary to the prevalent thinking of an increase in the number of selenoproteins from less to more complex genomes, Sec-containing proteins scatter all along the complexity scale. We believe that the particular distribution of each family is mediated by an ongoing process of Sec/Cys interconversion, in which contingent events could play a role as important as functional constraints. The characterization of eukaryotic selenoproteins illustrates some of the most important challenges involved in the completion of the gene annotation of genomes. Notably among them, the increasing number of exceptions to our standard theory of the eukaryotic gene and the necessity of sequencing genomes at different evolutionary distances towards such a complete annotation.

2. Harris MA
Ontologies for Biology: The Gene Ontology Project
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Database annotation and data mining

Abstract: The Gene Ontology (GO) project is a collaborative effort to construct and apply controlled vocabularies, or ontologies, to facilitate the biologically meaningful annotation of genes and their products in a wide variety of databases. Participating groups include the major model organism databases and other database groups such as the UniProt Consortium (Swiss-Prot + TrEMBL + PIR), the Genome Knowledgebase project, The Institute for Genomic Research (TIGR), and others. The GO project maintains three vocabularies describing different aspects of molecular and cell biology: Molecular function describes activities, such as catalytic or binding activities, at the molecular level. Biological process describes broad objectives, each accomplished by one or more ordered assemblies of molecular functions. Cellular component describes locations where a gene product may act, and includes both subcellular structures and macromolecular complexes. The GO vocabularies were originally developed for the description of gene products in databases, and many annotation data sets are made available to the public by GO Consortium members. The GO vocabularies and annotations are part of community resource that also includes software tools for working with the ontologies and annotations, project documentation, and links to relevant literature. The GO project has also provided a model for the development of ontologies for additional aspects of biology. Chief among the more recently developed vocabularies is the Sequence Ontology (SO), which provides a structured controlled vocabulary for sequence annotation, for the exchange of annotation data and for the description of sequence objects in databases. The SO and other emerging shared, structured vocabularies are publicly available from the Open Biology Ontologies web site (http://obo.sourceforge.net/). Ontologies must meet five criteria for inclusion in OBO: openness, sharable syntax (such as the GO syntax or OWL), orthogonality to other OBO ontologies, shared ID space, and term definitions.

3. Kleywegt GJ
Structural Bioinformatics - Understanding Protein Structure and Function
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Structural bioinformatics

Abstract: Structural Bioinformatics deals with the interface between sequence and structure (e.g., structure prediction, structure-based sequence alignment), that between structure and function (e.g., predicting function on the basis of observed structural similarities), and with the analysis of structural information per se (e.g., fold comparison and database construction). In this talk I will discuss four of our recent and on-going Structural Bioinformatics projects. - We have assessed and compared the performance of eleven web-based servers for fold-comparison that can be used to find out if a newly determined protein structure displays any similarity to known structures. - We have previously described SPASM and the SPASM server that can be used to answer the question: "does this structural motif (e.g., active site, ligand-binding site, or a strange loop) occur in any other protein structures?". One of the earliest tests of the method was to answer the question: "do left-handed helices occur in natural protein structures?". We found a very significant hit and have therefore undertaken a more detailed investigation. The preliminary results of this study will be presented. - In order to make working with sequences easier for structural biologists (and, hopefully, to make working with structures less daunting to people from the "sequence world"), we have developed a workbench called Indonesia . This program, written in Java, can be used to superimpose structures and derive sequence alignments from that, align sequences from scratch or to a profile derived from a (possibly structure-based) sequence alignment, derive HMMs from such alignments, identify short sequence patterns in them, etc. Sequence alignments can also be imported from a range of other programs, and they can edited, coloured, decorated and printed with the program. - The Uppsala Electron Density Server, EDS, provides access to electron-density maps for more than 10,000 crystal structures in the PDB. In addition to the maps, several validation statistics are provided for every entry. It is hoped that this server will help to increase the appreciation of non-crystallographers for the varying quality (accuracy and precision) of the macromolecular crystal structures available from the PDB.

4. Corà D, Herrmann C, Dieterich C, Di Cunto F, Provero P, Caselle M
Identification of human transcription factor binding sites by comparative genomics.
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Comparative genomics

Abstract: Understanding transcriptional regulation of gene expression is one of the greatest challenges of modern molecular biology. A central role in this mechanism is played by transcription factors (TF) which typically bind to specific, short DNA sequence motifs which are usually located in the upstream region of the regulated genes. We discuss here a simple and powerful approach for the identification of these cis-regulatory motifs based on human-mouse genomic comparison. By using the catalogue of conserved upstream sequences collected in the CORG database [1] we construct sets of genes sharing the same overrepresented motif in their upstream regions both in human and in mouse. We perform this construction for all possible words from 5 to 8 nucleotides in length and then filter the resulting sets looking for two types of evidence for coregulation: first, we analyse the Gene Ontology annotation of the genes in the set looking for statistically significant common annotation; second, we analyse the expression profiles of the genes in the set as measured by microarray experiments, looking for evidence of coexpression. The sets which pass one or both these filters are conjectured to contain a significant fraction of coregulated genes, and the upstream motifs characterizing the sets are thus good candidates to be the binding sites of the TF's involved in such regulation. In this way we find various known motifs (which we use to validate our approach) and also some new candidate binding sites.

5. Ambesi-Impiombato A, Di Bernardo D
Novel Computational Method for Human Cis Regulatory Elements Prediction
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Comparative genomics

Abstract: Introduction Biological mechanisms underlying the regulation of gene expression are not completely understood. It is known that they involve binding of transcription factors to regulatory elements on gene promoters. However, attempts to computationally predict such elements in DNA sequences of gene promoters typically yield an excess of false positives. Computational identification of CREs is currently based mainly on three different approaches: (1) identification of conserved motifs using interspecies sequence global alignments (Pennacchio 2001); (2) identification of conserved motifs in the promoters of coregulated genes (Hughes et al 2000, Sudarsanam et al 2002, Bussemaker et al 2001, Eskin et al 2002, Bailey et al 1994, Fujibuchi et al 2001, Palin et al 2002); (3) computational detection of known experimentally identified motifs in genes’ promoters for which binding factors are unknown (Kel et al 2003). The limitations of the first approach are caused by the high mutation, deletion and insertion rates in gene promoter regions (Ludwig 2002), that prevent a correct alignment of the promoter region. As experimental data is accumulating on known DNA binding elements, increasing amount of information can be used to search for similar elements in genes for which transcription factors are unknown. Our approach involves consensus pattern search of known regulatory elements in 5kb upstream of gene transcription start site against a background word distribution simulated by shuffling symbols in consensus, with the aim of minimizing false positives by using a background model of random matches of experimentally determined consensi, and integrating information from the promoters of ortholog genes.

6. Sironi M, Riva L, Menozzi G, Pozzoli U
Silencer elements as possible inhibitors of pseudoexon splicing
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Comparative genomics

Abstract: Introduction Production of functional mRNAs in eukaryotic organisms is critically dependent upon the accuracy of pre-mRNA splicing. The presence of well-defined cis-elements, namely the 5’ and 3’ splice sites and the branch point, is necessary but not sufficient to define intron-exon boundaries (1). Sequences within exon bodies have a prominent role in promoting exon definition; the best understood exonic elements are represented by exonic splicing enhancers (ESE) which represent binding sites for SR proteins (2). Sequences that act as exonic splicing silencer (ESS) have also been described but are less well characterized than ESEs. It has been reported that pseudoexons (i.e. intronic sequences displaying good 3’ and 5’ splice sites) outnumber real exons by an order of magnitude (3). Recent observations (4, 5) suggest that a subpopulation of pseudoexons might exist in the human genome requiring only subtle changes to become splicing competent. Here we have applied a biocomputational approach to address the question of why pseudoexons are ignored and to identify putative splicing repressor elements.

7. Pavesi G, Mauri G, Pesole G
Weeder Web: a Web-Based Tool for the Discovery of Transcription Factor Binding Sites
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Comparative genomics

Abstract: Understanding the complex mechanisms governing basic biological processes requires the characterization of regulatory motifs modulating gene expression at transcriptional and post-transcriptional level. In particular, the extent, chronology and cell-specificity of transcription are modulated by the interaction of transcription factors (TFs) with their corresponding binding sites (TFBS), located in the promoter regions of the genes. The ever growing amount of genomic data, complemented by other sources of information concerning gene expression opens new opportunities to researchers. Transcription factor binding sites are generally short (less than 12-14 bp long) and degenerate oligonucleotides, and this fact makes significantly harder their computational discovery and large-scale annotation. Hence, the need for efficient and reliable methods for detecting novel motifs, significantly over-represented in the regulatory regions of sets of genes sharing common properties (e.g. similar expression profile, biological function, product cellular localization, etc.), that in turn could represent binding sites for the some common TF regulating the genes. We present here a Web server that provides access to a previously developed enumerative pattern discovery method [1] that is able to carry out an exhaustive search of significantly conserved degenerate oligonucleotide patterns with remarkable computational efficiency. Also, the interface has been designed in order avoid the explicit definition of a large number of parameters that were included in the original general-case implementation of the algorithm, as well as to produce a simpler “user-friendly” output. The parameters have been set to default values suitable for capturing TFBSs. The interface Web address is: http://www.pesolelab.it:8080/weederWeb

8. Cruz P, Maselli V, Sanges R, Stupka E
CODE: Comparative Genomics of Disease Genes.
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Database annotation and data mining

Abstract: The CODE project aims to provide a comparative genomics analysis of curated disease genes families. It integrates experimental and human verified information into automated gene-centric pipelines, which regularly map disease genes and related features across available sequenced metazoan genomes. Of particular interest to the outcome of the project is the semi-automated annotation of non-coding sequences (ncRNA, promoters, enhancers and splice regulators). Considerable attention is paid to the evolutionary clues provided by the analysis in particular when model animals are concerned. Finally, the establishment of a community portal, complementary to the existent international projects, will disseminate the results of the research and augment the annotation of disease genes.

9. Masseroli M, Martucci D, Pinciroli F
Genome dynamic and statistical functional annotations for biological knowledge mining from microarray data
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Database annotation and data mining

Abstract: Statistical and clustering analyses of gene expression results from high-throughput microarray experiments produce lists of hundreds of genes candidate regulated, or with particular expression profile patterns, in the conditions under study. Independently of the microarray platforms and analysis methods used to identify and cluster differentially expressed genes, the common task any researcher faces is to translate the identified lists of genes into a better understanding of the patho-physiological phenomena involved. To this aim, many biological annotations are available within numerous heterogeneous and widely distributed databases. Although several tools have been developed for annotating lists of genes, most of them do not provide methods to evaluate the relevance of the retrieved annotations for the considered set of genes, or to estimate the functional bias introduced by the gene set present on the specific array used to identify the considered gene list. Lately, few tools have been proposed that use gene annotations provided through the Gene Ontology (GO) [1] controlled vocabularies to enrich lists of genes with biological information. Some of them (e.g. Affymetrix Data Mining Tool, DAVID, FatiGO, GoMiner, MAPPFinder) also present the GO categories more relevant for a given set of genes according to the number of genes of the considered set belonging to a given category, or in relation to their statistical evaluation performed using some basic tests. To extend these functionalities we created GFINDer (i.e. Genome Function INtegrated Discoverer, http://www.medinfopoli.polimi.it/GFINDer/), a web server able to automatically provide large-scale lists of user-classified genes with the statistically significant functional profiles that biologically characterize the different gene classes in a considered gene list.

10. Guffanti A, Luzi L, Confalonieri S, Trubia M, Volorio S, Graziani S, Pelicci PG, Di Fiore PP
A bioinformatic strategy for large-scale identification and annotation of chromosomal aberrations in tumors
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Database annotation and data mining

Abstract: We describe here the rationale, implementation and results of a bioinformatic strategy for large-scale identification and annotation of chromosomal translocations in tumours, based on sequence and annotation comparison between human transcriptome and EST partial cDNA sequences derived from tissues or cell lines. We also illustrate how the sequencing and subsequent careful bioinformatic analysis of a number of identified candidate translocation cDNAs revealed the complexity of distinguishing recombination from true translocation events. Finally, we suggest some EST filtering and cleaning strategy for pursuing EST-based “in silico” translocation identification projects.

11. Tasco GL, Montanucci L, Fariselli P, Martelli PL, Marani P, Casadio R
Protein structures and thermostability
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Structural genomics

Abstract: What is thermostability? This question is still unanswered in spite of several studies aiming at the determination of typical features of thermostable proteins (for a recent review see [1]). We tackled the problem considering a large set of proteins from thermophilic and hyperthermophilic organisms available in the PDB with atomic resolution. A PDB derived data base was generated containing proteins from thermophiles and their counterparts from mesophiles, with the specific constraint of sequence identity >30% and difference in sequence length <20%. By this, 128 proteins from thermophiles were compared to 109 structures from mesophiles with a root mean square deviation <0.29 nm. Residue composition, secondary structure, length of secondary structure motifs, hydrogen bonds, salt bridges, composition of solvent accessible surface were evaluated with specifically developed programs in both sets in order to perform a statistical analysis. The results of our investigation are as follows: proteins from thermophiles are endowed with more charged residues, particularly in the exposed surfaces, with more salt bridges, that are more accessible on average as compared to those in proteins from mesophiles. However neither the content of secondary structure neither the length of secondary structure motifs was significantly different. These data, all together suggest that thermostable proteins as compared to their mesophilic counterpart are endowed with more electrostatic interactions, particularly on the protein surface to stabilize more water dipoles and compensate for thermal motion at high temperatures.

12. Passerini A, Frasconi P
Learning to discriminate between ligand bound and disulfide bound cysteines
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Structural genomics

Abstract: Non-free cysteines that are not involved in the formation of disulfide bridges are very often bound to prosthetic groups that include a metal ion and that play an important role in the function of a protein. The discrimination between the presence of a disulfide bridge (DB) or a metal binding site (MBS) in correspondence of a bound cysteine is often a necessary step during the NMR spectral assignment process of metalloproteins and its automation may significantly help towards speeding up the overall process. Several proteins are known where both situations are in principle plausible and it is not always possible to assign a precise function to each cysteine (see e.g. {2,1,5]). We formulate the prediction task as a binary classification problem: given a non-free cysteine and information about flanking residues, predict whether the cysteine can bind to a prosthetic group containing a metal ion (positive class) or it is always bound to another cysteine forming a disulfide bridge (negative class). Firstly, we suggest a nontrivial baseline predictor based on PROSITE pattern hits. Secondly, we introduce a classifier fed by multiple alignment profiles and based on support vector machines (SVM)[3]. We show that the latter classifier is capable of discovering the large majority of the relevant PROSITE patterns, but is also sensitive to signal in the profile sequence that cannot be detected by regular expressions and therefore outperforms the baseline predictor.

13. Lexa M, Valle G
Combining rapid word searches with segment-to-segment alignment for sensitive similarity detection, domain identification and structural modelling.
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Structural genomics

Abstract: The most popular alignment and similarity search techniques are based on the classical Smith-Waterman scoring scheme. Conservation of a single structural or functional feature between proteins may be undetectable, because the similarities tend to persist only in the key areas, consisting of residues dispersed in a non-trivial manner. We propose a novel method that finds occurrences of short similar words common to the studied sequences and handles the identified matches in a manner similar to segment-to-segment alignment [2]. Our interest in this area stems from the development of programs for fast searches with mismatches in large biological databases [1]. As shown here, these programs can support large database searches that lead to automatic domain detection, sequence annotation. The use of this technique in fold-recognition and structure prediction is being studied.

14. Ferrè F, Ausiello G, Zanzoni A, Helmer-Citterich M
Large scale surface comparison for the identification of functional similarities in unrelated proteins
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Structural genomics

Abstract: We developed a systematic large-scale approach to identifying protein surface regions sharing shape and residue similarity. We used a new fast structural comparison algorithm (LSC: Local Structure Comparison) to exhaustively analyze a set of functionally annotated protein patches with a larger collection of protein cavities. From a dataset of about 10.000 protein surface patches extracted from a non redundant list of PDB proteins (p-value=10-7), we collected a grand total of 65910 matches among patch pairs that were stored in the SURFACE database. The functional meaning of most of the matches could be confirmed by other established methods: the presence of the same PROSITE and ELM motifs in the sequence, the presence of the same ligand in the PDB structure, similar GO terms, common SWISS-PROT keywords, sequence similarity, same SCOP superfamily and E.C. numbers. We noticed that the fraction of matches whose functional association can be confirmed by more methods sensibly decreases with the extension of the match.

15. Greco C, Sacco E, Vanoni M, De Gioia L
Structural determinants of the regulatory action exerted by the aminoterminal region of hSos1 on the Ras-GEF activity
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Structural genomics

Abstract: The information carried by the aminoacidic sequence can be used by different bioinformatic methods, in order to predict the 3D structure of a protein or its domains. The tools for sequence alignment permit to individuate homologous regions among proteins, and this represent the basis for a homology modeling procedure. The algorithms of secondary structure prediction use chemical, physical and statistical parameters to recognize if a region of sequence could assume a specific secondary structure. Fold recognition servers can test if a protein sequence is compatible with one of the known folds in the PDB. If these different tools give rise to homogeneous responses, it is possible to predict with good reliability the fold of a protein or single domains of unknown structure. hSos1 is a multidomain protein involved in the activation of the Ras signaling by catalyzing guanine nucleotide exchange on Ras. The Ras-GEF domain of hSos1 (Sos-Cat) is flanked by amino- and carboxyl-terminal regions, which are able to inhibit hSos1 activity towards Ras. To investigate the structural determinants of this inhibition, it is necessary to know the structural features of the involved domains. The carboxyl terminus of hSos1 contains a proline rich domain with consensus sequences for binding to the SH3 domains, while the amino-terminal region of hSos1 includes three domains: Histone domain, Dbl Homology domain (DH) and Pleckstrin homology domain (PH). The Histone domain is involved in the inhibition of the Ras-GEF activity of hSos1. It can also bind the PH domain, while it cannot interact with the DH domain. The DH domain is implicated in the inhibition of the Ras-GEF activity of hSos1, possibly through direct interaction with Sos-Cat. The PH domain is able to interact with the DH domain; the crystal structure of the PH-DH complex is available. We have focused on the intra-molecular interactions that occur among these domains in the activation/inhibition of hSos1 by means of computational tools, like the low-resolution protein-protein docking. The essence of the procedure is the reduction of protein structures to digitized images on a three-dimensional grid. The structural elements smaller than the step of the grid (e.g., atom-size) are not present in the docking. This feature permits to reduce the negative effect of structural changes upon complex formation on docking calculation.

16. Passamano M, D'Agostino N, Caprera A, Milanesi L
Comprehensive Analysis of Protein Kinase Domains
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Structural genomics

Abstract: Eukaryotic protein kinase (ePKs) constitute one of the largest recognized protein families represented in the human genome and are important players in virtually every signaling pathway involved in normal development and disease. The key feature that distinguishes ePKs superfamily members from other proteins is the sequence of contiguous stretch of approximately 250 aminoacids that constitutes the catalytic domain [1-2] . Around half the human kinases contain other domains in addition to the catalytic domain, which often are involved in kinases regulation, interactions with other partners or subcellular localization [3]. Domains present one of the most useful levels at which to understand protein function and domain family-based analysis, so we developed an automated analysis system for studying domain statistic distribution of kinase superfamily.

17. Costantini S, Colonna G, Facchiano AM
Comparative modelling for predicting the different conformations assumed by a protein during its different activities
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Structural genomics

Abstract: The knowledge of structural organization of proteins is crucial in understanding their role in the cell and the related molecular mechanisms. Comparative modelling has already become one of the most effective computational approaches in facilitating structural/functional characterization of many protein-coding sequences across genomes and it is based on the assumption that homologous proteins adopt the same fold to have the same function. On this basis, it is possible to model the 3D structure of a protein if it is known at least one experimental model of an homologous protein. However, it is evident that a large number of proteins, probably all, may assume different conformations depending on the different environmental conditions or the interaction with other molecules. Conformational modifications occur when a protein changes its monomeric / oligomeric state, enzymes adapt their conformation to the substrate when it is recognized, but also, very different secondary structures are observed in the normal and pathological forms of the prion protein. We are interested to apply the comparative modelling to predict the different conformations assumed by a protein to exert its biological activities or in different environmental conditions. In this work we applied the comparative modelling methods to create models of the interleukin 1beta (IL-1beta) and to investigate the conformational changes occuring when this protein interacts with its receptor (IL-1R).

18. Leo P, Marinelli C, Pappadà G, Scioscia G, Zanchetta L
BioWBI: an Integrated Tool for building and executing Bioinformatic Analysis Workflows
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Computer algorithms and applications

Abstract: Building integrated bioinformatic platforms is one of the most challenging tasks which Bioinformatics community is dealing with in recent years [1-2]. Facing this task, a number of specific problems arises connected to data integration, integration of specialized tools and algorithms. The solution described in this paper goes in the direction to solve this challenge. It is characterized by two original assumptions: 1) a quite sharp division between the data realm of a bioinformatics analysis and its components in terms of algorithms and processes, 2) the conception of a rigorous algebra that allows researchers to formalize their analyses in terms of atomic process workflows. As a result of this approach two bioinformatics web tools, BioWBI and WEE, have been designed and prototyped by our group to provide researchers with a virtual collaborative workspace in which defining their data-sources, drawing graphically as well as executing analysis workflows. These tools constitute the basic components of a much more general bioinformatic e-workplace.

19. Merelli E, Romano P, Scortichini L
A Workflow Service for Biomedical Application
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Computer algorithms and applications

Abstract: The proposed work has been developed in the O2I (Oncology over Internet) project context. O2I aims to develop and prototyping an integrated platform suitable to support the biomedical and clinical research during the retrieval, from Internet, and the integration, in a standard format, of both structured and textual information. Usually biomedical researchers interact, step by step, with the Web to query, select and integrate information; during the daily work, a bioscientist would benefit from a powerful tool able to execute queries consisting in several interrelated activities. In this scenario, the biomedical research process can be formulated as a workflow of activities, whose execution must be supported by a suitable middleware. We propose a workflow service agent to support bioscientist during the creation of their own workflows, by also monitoring their execution. In particular, in the O2I context, we are experimenting BioAgent an agent-based middleware developed at Camerino University; the middleware can be configurated by plugging-in agent-services to support the tool/services integration for a specific domain.

20. Mishra B, Policriti A
Systems Biology, Automata, and Languages
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Computer algorithms and applications

Abstract: The central theme of our work is related to problem of formulating a “unitary step” that defines how a complex biological system makes a transition from one “state” or one “control mode” to another, as well as the conditions under which such transitions are enabled. This is because we recognize that automata (either discrete or hybrid, that is capable of modeling a mixed discrete/continuous behaviour), based on the formulation of these unitary steps, can elegantly model biological control mechanisms, allow us to reason about such mechanisms in a modal logic systems with modes constructed over a next-time operator, and can become the foundational framework for the emerging field of systems biology. These models can lead to more rigorous algorithmic analysis of large amounts of biological data, produced as (numerical) traces of in vivo, in vitro and in silico experiments—currently a central activity for many biologists and biochemists. Since modeling biological systems requires a careful consideration of both qualitative and quantitative aspects, our automata-based tools can effectively assist the working biologists to make predictions, generate falsifiable hypotheses and design wellfocused experiments—activities in which the time dimension and a properly designed query language cannot be left out of consideration. Thus, ultimately, the aim of our work is to elucidate the role played by automata in modeling biological systems and to investigate the potential of such tools when combined with more “classical” approaches used in the past to devise models and experiments in biology. Our discussion here is based primarily on our experience with a novel system that we introduced recently (called, XS-systems) and used it to implement algorithms and software tools (Simpathica). These conceptual tools have been integrated with prototype implementations, and are currently undergoing many interesting and growing sets of enhancements and optimizations

21. Bultrini E, Pizzi E
Linguistic analysis of promoter regions in eukaryotic genomes
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Computer algorithms and applications

Abstract: Promoter recognition is one of the most difficult tasks in annotating eukaryotic genomes. Binding sites for transcription factors are very short sequences (5-15 bp) and not very well preserved in sequence. In addition, other signals can be associated with a regulatory region. For instance in vertebrates, some classes of promoters are associated with compositionally characterised regions (CpG islands) and there is also evidence that molecular conformation of human promoters is involved in the transcription activity [1, 2]. Following a previous investigation [3, 4], in the present work we propose a new procedure, based on well established statistical methods, to extract a set of oligonucleotides specifically characterising intron sequences. Partitioning of genomic sequences, based on the accordance to the extracted “introns’vocabulary”, reveals that intergenic DNA appears as a patchwork of different elements. The majority of them adopt the “introns’ vocabulary”, whereas some others (a small percentage) do not. We hypothesise that the identified linguistic property is a sort of “background-noise” of a genome; in this perspective regions that play a functional and/or a structural role have probably to emerge from the background, adopting specific compositional properties. The analysis of promoter sequences for the four examined genomes (C. elegans, D. melanogaster, M. musculus, H. sapiens) appears to confirm our hypothesis, as regions immediately surrounding the transcritpion start site deviate from the introns’vocabulary usage. Furthermore, analyses on C+G composition, bendability propensity and torsional rigidity of promoter sequences are presented.

22. Pardo M, Sberveglieri G, Wold B
Yet another Feature Selection Study for Microarrays
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Microarray algorithms and data analysis

Abstract: Two histopathologically different kinds of rhabdomyosarcoma (RMS) -alveolar and embryonal RMS- are associated with distinct clinical characteristics and different cytogenetic properties. Affymetrix microarrays (U133A/B) were used to characterize the 74 tumoral tissues of both kinds. For consistency with previous work, 8801 genes have been considered in our analysis. Also, the train/test division had been fixed to 56 training and 18 test data. Feature Selection (FS) is both useful for enhancing the classification performance and, more importantly, to discover biologically relevant genes. Therefore, FS is a hot topic in the application of machine learning to the analysis of microarray data.

23. Fu L-M, Medico E
FMC, a Fuzzy Map Clustering algorithm for microarray data analysis
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Microarray algorithms and data analysis

Abstract: As the microarray technology is emerging as a widely used tool to investigate gene expression and function, laboratories over the world have produced and are producing a huge amount of data, which demand advanced and specialized computational tools to process them. Clustering methods have been successfully applied to such data to reorganize the data and extract biological information from them. But the classical clustering methods [1] such as k-means and hierarchical clustering have some intrinsic limits such as the linear, pair-wise nature of the similarity metrics (which fail to highlight non-linear substructures of the data) and the univocal assignment of each gene to one cluster (which may fail to highlight cluster-to-cluster relationships) [2]. Here we introduce a novel method for clustering microarray data, named Fuzzy Map Clustering (FMC), which may partly overcome these limits. Basically, the clustering process of FMC starts from identification of an initial set of clusters by calculating the “density” around each data point (object), that is, the average proximity of its K nearest other objects (K neighbours) and choosing the ones that have the highest density among all their K neighbors. K can be a fixed number of choice or the number of neighbors within a distance threshold. Then, each object in the dataset is assigned a fuzzy membership to all the defined clusters (a vector containing a percentage of membership to all the clusters). Membership is assigned so that similar objects have similar fuzzy membership vectors. Membership assignment is optimized by measuring how the fuzzy membership vector of one object can be approximated by the vectors of its neighbors. Finally, a process based on the merging of adjacent clusters and fuzzy membership reassignment is reiterated until the number of clusters is reduced to a fixed one decided by the operator. Our computational experiments have shown that FMC can correctly reveal the true cluster structure of the dataset if such structure exists, even if the clusters contained in the dataset have arbitrary shape. And perhaps the basic idea underlying FMC points out a new way to develop novel clustering methods with good mathematical foundation.

24. Masulli F, Rovetta S
Ensembling and Clustering Approach to Gene Selection
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Microarray algorithms and data analysis

Abstract: In pattern recognition the problem of input variable selection has been traditionally focused on technological issues, e.g., performance enhancement, lowering computational requirements, and reduction of data acquisition costs. However, in the last few years, it has found many applications in basic science as a model selection and discovery technique, as shown by a rich literature on this subject, witnessing the interest of the topic especially in the field of bioinformatics. A clear example arises from DNA microarray technology that provides high volumes of data for each single experiment, yielding measurements for hundreds of genes simultaneously. In this paper, we propose a flexible method for analyzing the relevance of input variables in high dimensional problems with respect to a given dichotomic classication problem. Both linear and non-linear cases are considered. In the linear case, the application of derivative-based saliency yields a commonly adopted ranking criterion. In the non-linear case, the approach is extended by introducing a resampling technique and by clustering the obtained results for stability of the estimate. The method we propose (seeTab. 1) is termed Random Voronoi Ensemble since it is based on random Voronoi partitions , and these partitions are replicated by resampling, so the method actually uses an ensemble of random Voronoi partitions. Within each Voronoi region, a linear classification is performed using Support Vector Machines (SVM) with a linear kernel , while, to integrate the outcomes of the ensemble, we use the Graded Possibilistic Clustering technique to ensure an appropriate level of outlier insensitivity.

25. Cordero F, Lazzarato F, De Bortoli M, Weisz A, Cicatiello L, Scafoglio C, Basile W, Calogero RA
Putative Estrongen-Responsive Genes database (PERG)
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Estrogens are known to regulate the proliferation of breast cancer cells and to alter their cytoarchitectural and phenotypic properties, but the gene networks and pathways by which estrogenic hormones regulate these events are only partially understood. As starting point to obtain a genome-wide picture of the genes modulated by estrogens we have built a database of the genes having in their putative promoter region Estrogen-responsive Element (ERE).

26. Muselli M, Ruffino F, Valentini G
An Artificial Model for Validating Gene Selection Methods
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Every DNA microarray experiment provides thousands of real values that correspond to the gene expression levels of a tissue. This technology can offer a new valuable tool for medical diagnosis, since it can yield a reliable way to determine the state of a patient (e.g. healthy or ill) by measuring the gene expression level of its cells. The dataset obtained through several microarray experiments can be represented by a table with m rows and n columns: each of its rows is associated with an examined tissues and each column corresponds to one of the considered genes. To specify a particular state for each tissue, a final column must be added to the table. Typically m ~ 100, while n ~ 10000. When analyzing this table to retrieve a model for diagnosis, we have two different targets: besides finding a method that recognizes the state pertaining to a specific tissue (discrimination), we wish to determine the genes involved in this prediction (gene selection). The quality of the discrimination task can be simply estimated through a measure of accuracy, obtained by proper methods (hold-out, cross validation, etc.). On the contrary, it is very difficult to evaluate the results of the gene selection process, since the genes really involved in the onset of a state are actually unknown. A possible way of validating gene selection could be to analyze the performance of the considered method on a diagnosis problem where significant genes are known. Unfortunately, at the present no problem of this kind is available. An alternative approach consists in building an artificial model, starting from proper biological motivations, that generates data having the same statistical characteristics of gene expression levels produced by microarray experiments. As proposed in [1], the behavior of a biological system can be described through regulatory networks that represent the interaction between different genes. The nodes and the edges of these networks are ruled by dynamic equations that involve the concentration of products encoded by genes and consequently the gene expression levels. Each concentration is expressed through a real variable that changes with time and can determine the transition of the system from a state to another. When the organism is in a particular state some concentrations are lower than a given threshold (specific for each gene), while others exceed a proper value. Thus, if we select a definite state, we can say that a gene is in the active state, if its expression level has a value consistent (lower or greater than a specific threshold) with that state. With this definition each gene can be described by a binary variable, assuming value 1 if the gene is active and 0 otherwise. Also the presence of the considered state can be expressed through a Boolean variable, which takes the value 1, if the tissue is in that state, and 0 otherwise. Consequently, the whole biological system can be described by a Boolean function f with n inputs. Each of the m available microarray experiments corresponds to a particular entry of the truth table for the function f; it is formed by an input-output pair (x,y), where x is a vector of n binary values associated with the examined genes and y is a binary value asserting if the corresponding tissue is in the considered state or not. According to this setting, a technique to generate artificial data for validating gene selection methods consists in building a proper Boolean function f, whose truth table entries share the same statistical characteristics of gene expression levels produced by microarray experiments. Then, the quality of the gene selection method is measured by the percentage of significant genes retrieved. Although each Boolean function can be described by a logical expression containing only AND, OR and NOT operations, in our case it is more convenient to obtain f in a different way. In fact, it can be observed that in biological systems genes can be assembled into groups of expression signatures, i.e. subsets of coordinately expressed genes related to specific biological functions. These groups of genes are, in some sense, equivalent with respect to the state determination. Thus, the Boolean function f can be viewed as a combination of several groups of genes. Each group is considered active if a sufficiently large number of its genes is active. Then, the function f assumes value 1 if the number of active groups exceeds a given threshold. A proper algorithm for constructing Boolean functions with these characteristics has been implemented. It is able to generate data resembling those produced by several microarray experiments for diagnostic purpose. In these cases two or more different states are analyzed and the algorithm constructs a specific Boolean function (adopting the above approach) for each state. Then, to allow the application of the gene selection method, a set of input-output pairs is produced for each Boolean function built. The algorithm includes several parameters that can be tuned to achieve a good agreement between the resulting collection of input-output pairs and the dataset produced by microarray experiments for a specific problem. An evaluation of this agreement can be obtained by looking at the accuracy values scored by a discriminant method for different numbers of considered genes. In this contribution, the Leukemia dataset has been considered and a proper artificial model has been generated by constructing a specific Boolean function for each of the two variants of leukemia examined. Figure 1 shows the accuracy values obtained through the leave-one-out approach by applying the SVM-RFE method described in and the technique proposed in. As one can note, the agreement between the success rate curves is excellent in both situations.

27. Di Camillo B, Toffolo G, Cobelli C, Nair KS
Selection of Insulin Regulated Gene Expression Profiles Based on Intensity-Dependent Noise Distribution of Microarray Data
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Insulin resistance in skeletal muscle plays a key role in the development of Type 2 diabetes. To define the molecular mechanisms underlying insulin-induced changes in gene expression, recent studies, performed using microarrays techniques, identified genes involved in insulin resistance in control vs diabetic subjects, before vs after insulin treatment, i.e. exploiting only steady state information. Although extremely useful in order to identify candidate genes involved in analyzed processes and to develop new physiological hypothesis, these data can tell little about the interactions among genes. To infer genes regulation, it is of paramount importance to monitor dynamic expression profiles, i.e. time-series of expression data collected during the transition from one physiological state to another. A first necessary step, in order to limit the analysis to those genes that actually change expression over time, is to select differentially expressed genes. Methods proposed in the literature usually deal with comparison of static conditions rather than time-course experiment data, and are based on application of modified t-test and ANOVA test which assume Gaussian distribution of analyzed variables. These methods test the significance of the differential expression gene by gene, and their application requires at least two replicated experiments per each condition. In time course experiments, a number of samples is monitored across time and complete replicates of the experiment are seldom available, mainly for cost reasons. Therefore, differentially expressed genes are often selected using an empirical fold change (FC) threshold. This is a far-from-ideal situation, since it is based on an arbitrary choice (e.g. FC=2). In the case of Affymetrix chips, this choice is even more questionable since a constant threshold does not take in account the intensity dependence of the measurement errors, which is a wellknown feature of this technology.. Here, we propose a novel method for gene selection, to be applied on dynamic gene expression profiles, which explicitly accounts for the properties of the measurement errors and addresses the situation where a relative small number of replicates is available.

28. Barberis M, De Gioia L, Ruzzene M, Sarno S, Coccetti P, Pinna LA, Vanoni M, Alberghina L
The Cyclin-Dependent Kinase Inhibitor Sic1 of Saccharomyces cerevisiae Is a Functional and Structural Homologous to the Mammalian p27Kip1
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: In budding yeast Sic1, an inhibitor of cyclin-dependent kinase (Cki), blocks the activity of Cdk1-Clb5/6 (S-Cdk1) kinase required for the initiation of DNA replication that takes place only when Sic1 is removed . Deletion of Sic1 causes premature DNA replication from fewer origins, extension of the S-phase and inefficient separation of sister chromatids during anaphase, whereas delaying S-Cdk1 activation rescues both S and M phase defects. Despite the well documented relevance of Sic1 inhibition on S-Cdk1 for cell cycle control and genome instability, the mechanism by which Sic1 inhibits S-Cdk1 activity remains obscure. Sic1 has been proposed to be a functional homologous of mammalian Cki p21Cip1, that is characterized by a significant sequence similarity with Cki p27Kip1, inhibitor of the Cdk2/Cyclin A kinase activity during S-phase.

29. Marabotti A, D'Auria S, Rossi M, Facchiano AM
Modelling the Three-Dimensional Structure of a Sugar Binding Protein from a Thermophilic Organism: Analysis on Stability and Sugar Binding Simulations.
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: The characterization of proteins from thermophilic organisms is becoming more and more interesting for possible biotechnological applications. Recently, the complete genome of a hyper-thermophilic archaebacterium, P. horikoshii, was sequenced [1] and a sugar binding protein (Ph-SBP) was identified by means of analysis of its sequence similarity. Some preliminary experimental information are available on its binding properties and on its structural features; however, the lack of information about its 3D structure impairs the complete knowledge of its conformational properties and interactions with its ligands. Here, we present the results of the homology modelling strategy we used to predict the 3D structure of Ph-SBP, and the analysis we made on the resulting model in order to assess its reliability, with particular care to its expected thermostability features and sugar binding properties.

30. Sboner A, Barbareschi M, Dell'Anna R, Demichelis F
Large scale TMA experiments: automation and data management
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Characterization of gene-expression profiles with DNA microarrays provides a powerful mean to discover disease-related genes, particularly in cancer. It is well known that clinical validation of disease-relates genes, through standard molecular analysis on individual tissue sections needs enormous effort in terms of time and costs. To overcome this problem, the Tissue Microarray (TMA) methodology has been recently developed: a high-throughput technology enabling “genome-scale” molecular pathology studies. In this paper we briefly present our technological platform designed and optimized for the complete management of Tissue Microarrays experiments. Our comprehensive system is very flexible regarding the management of data and it allows a wide range of microarray experiments on different diseases. We also obtained promising new results of biomarker expressions on ovarian and breast cancer, in terms of discrimination of patients’ overall survival and relapse free survival.

31. Eleuteri A, Tagliaferri R, Acernese F, Milano L, De Laurentiis M
Information Geometry for Survival Analysis and Feature Selection
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: In this paper an information geometric approach to survival analysis is described. A neural network is designed to model the probability of failure of a system, and it is trained by minimising a suitable divergence functional in a Bayesian framework. By using the trained network, minimisation of the same divergence functional allows for fast, efficient and exact feature selection.

32. Cozzini P, Fornabaio M, Mozzarelli A, Spyrakis F, Kellogg GE, Abraham DJ
HIV-1 protease: a good system to evaluate protein-ligand interactions, water role and protonation state, using an empirical approach
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: A set of 23 protein-ligand complexes of HIV-1 protease and inhibitors has been used as a validation test of an empirical approach, to study protein ligand interactions, considering the role of water molecules involved and the protonation state of protein and ligand ionizable groups. It is well known that the protein ligand binding process is a concerted sum of single events, so many aspects have to be considered. We have demonstrated that an empirical approach based on experimental LogP values and structural information could be used to design new ligands and to understand biomolecular association from several points of view. It is also known that water presence and behaviour can affect the binding. Furthermore, modeling the exact protonation state of several ionizable groups leads to a more realistic in silico model design. HIV-1 protease-ligand complexes represent a good system to experiment the empirical approach of the HINT scoring software, because of the good resolution of crystal data, the well known behaviour of the most important water molecule, WAT301, the presence of a set of water molecules in the cavity surrounding the ligands and, moreover, because a more exact treatment of protein and inhibitor ionizable groups could affect the correctness of the models. We have first analysed the role played by five water molecules placed into the active site and well determined both by X-ray crystallographic analyses and GRID simulations. In addition, we have considered the contributions of another twelve waters surrounding the binding cavity. The different values of the HINT scores, calculated for ligand-water and protein-water interactions, could thus be used to define a water importance scale and to understand the role played by each molecule in the binding stabilisation. We have pointed out, in agreement with data reported in literature, the significance of water 301, whose presence is necessary for the complex formation, and the less relevance of water 313, 313’, 313bis and 313bis’, which don’t really affect the binding process but contribute to define the cavity shape. Finally, analyses of the environment surrounding the external ligand extremities, performed for one single HIV-1 protease-inhibitor complex, confirmed our supposition that protein and ligand solvation waters could make strong interactions with one of the two entities or with both but, nevertheless, are not essential for the binding process. Again, the exactly setting of the protonation state was analysed on a protein ligand complex (pdb code 1A30) where experimental Ki at different pH values was carried out.

33. Marra D, Malusa F, Piersigilli F, Manniello MA, Romano P
The CABRI website: integrating biological resources information in the bioinformatics network environment
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Biological resources are essential tools in modern biomedical research. It is therefore essential that information on quality biological resources are well known in the scientific community. Web sites distributing this information are more and more widely available, but access and retrieval of the information through a unique system is highly desirable. The CABRI (Common Access to Biological Resources and Information) project was funded by the European Union (EU) from 1996 to 1999. It aimed at the setting up of a “one-stop-shop” for biological materials and related information. This project led to the setting up of the CABRI web site (http://www.cabri.org/), where catalogues of participating cultures collections could be queried, either individually or collectively, and the Guidelines for the Collection Quality Management that were adopted by partners, could be examined. It includes information on more than 120.000 items from 28 collections including bacteria, filamentous fungi and yeasts strains, human and animal cell lines, plasmids, phages, DNA probes, plant cells and plant viruses from nine centers (BCCM, CABI, CBS, CIP, DSMZ, ECACC, ICLC, NCCB, NCIMB). This wealth of information has been made searchable through an implementation of SRS (Sequence Retrieval Software). In 2001, a new project was launched, the European Biological Resource Centers Network (EBRCN). This project has been funded by the EU for the period 2001 - 2004. Among its objectives is the extension of the CABRI on-line services, with special emphasis on the achievement of a better integration with molecular biology and literature databanks (see http://www.ebrcn.org/).

34. Toppo S, Fontana P, Velasco R, Valle G, Tosatto SCE
FOX (FOld eXtractor): A novel protein fold recognition method using iterative PSI-BLAST searches and structural alignments
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: We present a novel fold recognition method based on the combination of detailed sequence searches and structural information. Presently the protocol implements two different approaches to assign the correct fold to the target protein sequence: the first is based on database secondary structure search and the second is based on iterative database sequence search. In the first phase a secondary structure prediction of the target is performed and based on the ConSSPred protocol. This prediction is used to search for hits against a database of known secondary structures extracted from PDB (using DSSP). The search is based on a two-step strategy: the first step is based on a Smith-Waterman local secondary structure similarity search with a specific substitution matrix optimized for secondary structure alignment. The second is based on a global alignment based on SSEA (Secondary Structure Element Alignment), as implemented in our program MANIFOLD, to refine the score and the alignment itself in the region extracted from the first step. At the end of the first phase a list of hits that share a similar secondary structure topology with the target sequence is extracted. The second phase is based on a modified protocol for scanning the sequence database called SENSER. In the beginning of the second phase, BLASTP is used to scan the target sequence against the NR database. These initial hits are clustered to reduce sequence bias and a seed alignment with 20 or fewer sequences generated. This step ensures that PSI-BLAST can be jump-started with a more sensitive initial profile, increasing its sequence diversity. PSIBLAST is run for four iterations (e-value inclusion threshold 10e-3) on the NR60 database of known sequences. NR60 is produced by applying the CD-HIT algorithm to cluster the NR database at 60% sequence identity. Sequences producing NR60 hits with the query are assigned either to the significant sequence space (e-value <= 10e-3) or the trailing end (e-value <= 10) for further use. The profile is used to search the PDBAA database of sequences with known structure. If a significant PDBAA hit (e-value <= 10) is found, the protocol proceeds to the back-validation step (see below). If no significant hit is found, or the hit does not back-validate, a new PSI-BLAST search, using the above "4+1" protocol on NR and PDBAA, is started for the highest ranking sequence (i.e. lowest e-value) in the significant sequence space. Sequences from NR60 matching the query are also assigned to either the significant sequence space or the trailing end. Significant PDBAA hits are again submitted to back-validation. If no significant PDBAA hit is recorded and the significant sequence space has been exhausted, then the protocol uses the trailing end sequences as additional starting points for PSI-BLAST searches. In contrast to previous sequences, which were assumed to be similar enough to the target to imply homology, these sequences are submitted to back-validation before proceeding to the "4+1" PSIBLAST protocol. The back-validation step consists in using PSI-BLAST to find the target starting from a different query sequence, found as described above. I.e. due to the asymmetric nature of PSI-BLAST, if sequence A finds sequence B it is not always the case that B also finds A. Sequences that back-validate are more likely to be correct hits. Once a sequence from PDBAA back-validates and its secondary structures is compatible with the one of the target sequence as found in the first phase, the protocol builds a target to template alignment and stops. The procedure described so far serves to identify a template structure for the target sequence. In order to produce an accurate alignment, HMMER is used to build a hidden Markov model (HMM) based on the HOMSTRAD sequence alignment. The target is then aligned to the template using this HMM. Preliminary results for the method indicate a clear increase in both detection rate and alignment accuracy for distantly homologous sequences. Presently FOX has been tested on Fischer-68 test set to compare its performance with standard PSI-BLAST searches, GenTHREADER and the original SENSER protocol. As expected the introduction of the secondary structure prediction of the protein target and the database secondary structure searches in the first phase have increased detection sensitivity and sensibility of the method compared to profile based searches as PSI-BLAST and SENSER protocol (Fig. 1). The performance is comparable to GenTHREADER showing that right template structure is always found in the top 50 hits as shown in Fig. 1. Further score optimization and development are required to definitely test the entire protocol and make the program available as a web-based server from our group's web site (http://protein.cribi.unipd.it/).

35. Trovato A, Seno F
A new perspective on Analysis of Helix-Helix Packing Preferences in Globular Proteins
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: For many years it had been believed that steric compatibility of helix interfacescould be the source of the observed preference for particular angles between neighbouring helices as emerging from statistical analysis of protein databanks. Several elegant models describing how side chains on helices can interdigitate without steric clashes were able to account quite reasonably for the observed distributions. However, it was later recognized that the “bare” measured angle distribution should be corrected to avoid statistical bias. Disappointingly, the rescaled distributions dramatically lost their similarity with theoretical predictions casting many doubts on the validity of the geometrical assumptions and models. In this report we elucidate a few points concerning the proper choice of the random reference distribution. In particular we show the existence of crucial corrections induced by unavoidable uncertainties in determining whether two helices are in face-to-face contact or not and their relative orientations. By using this new rescaling, we show that “true” packing angle preferences are well described by regular packing models, thus proving that preferential angles between contacting helices do actually exist.

36. Papaleo E, Vai M, Popolo L, Fantucci P, De Gioia L
Structural models of the catalytic domain of the yeast β-(1,3)-glucan transferase Gas1 by combined threading and secondary structure prediction methods
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Gas1p is an exocellular glycoprotein of Saccharomyces cerevisiae and plays a crucial role in cell wall assembly, due to its β-(1,3)-glucan transferase activity. The identification of Gas1p homologues in other yeast species and fungi allowed the definition of a new family of glycosyl hydrolases, family GH72, on the basis of sequence similarity. Hydrophobic cluster analysis of the catalytic domain (C-domain) of some GH72 members suggests a (β/α)8 barrel fold, also supported by our recent study on the structural and functional characteristics of the C-domain of Gas1p. Standard homology modelling approaches cannot be used to infer the structure of C-domain of Gas1p and related proteins, due to the lackness of suitable homologues of known 3D structures. Threading and fold recognition approaches have been shown to predict fold of novel proteins with relatively high accuracy. However it should be noted that the detection of possible remote homologues is only the first step of successful modelling. In fact alignment to the same scaffold produced by different threading methods can be significantly dissimilar and affected by local errors, making difficult the derivation of a good structural model. With the aim of unraveling the key molecular characteristics of the C-domain of Gas1p and related proteins, in the present work, a procedure has been worked in which data derived from threading methods, multiple sequence alignments and secondary structure predictions were merged and compared to experimental results in order to obtain reliable and detailed three dimensional models.

37. Pasa S, Kohn KW, Aladjem MI, Consiglieri C, Cocozza S, Bordo D, Parodi S
In silico model of Molecular Interaction Maps: c-Myc and cell cycle control
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Cell behaviour is largely determined by protein:protein interactions. In particular, it has become increasingly evident that cell cycle control, differentiation and death are governed by networks of molecular interactions involving both proteins and DNA. The concomitant rapid increase of data concerning gene expression as measured in large scale experiments, has made evident the need to represent biochemical effectors (proteins and DNA) and their mutual interaction in an integrated way, in the form of a Molecular Interaction Map (MIM). To describe MIMs in a coherent graphical notation, the use of “wiring diagrams” similar to those adopted in electronics is proposed. In this work we describe the main features of a MIM focused on the oncogene c-Myc and on its role in cell cycle control.

38. Santarossa G, Roggia L, De Gioia L, Fantucci P
A Molecular Dynamics Study of the DoubleDominant Negative Mutation W809E/T935E in Ras-GEF Complex
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Ras proteins are guanine nucleotide binding enzymes, with intrinsic low GTP-ase activity, involved in the control of cell growth and cell differentiation. They act as molecular switches, cycling between active GTPbound state and inactive GDP-bound state. Ras activation state is regulated by the competing activity of GTPase activating proteins (GAPs) and guanine nucleotide exchange factors (GEFs), the latter promoting the activation of Ras catalysing the exchange of GDP with GTP. In most tumors the activity of Ras proteins is altered, resulting in hyperactive GTP-bound forms of Ras, either because of a reduced GTPase activity or because of an increased GDP/GTP exchange. GEF mutant W809E/T935E (GEFmut) results in a dominant negative GEF, catalitically inactive, which binds to Ras with great affinity and forms a stable complex in the presence of excess nucleotide. By means of Molecular Dynamics (MD) simulations we compared different trajectories of Ras-GEFwt and Ras-GEFmut systems and analyzed them in terms of both energetic and structural parameters, to correlate the conformational differences of wt and mutant GEFs during their interaction with Ras with the observed modifications in Ras biological activity.

39. Roasio R, Fu L-M, Botta M, Medico E
MulCom: a novel program for the statistical analysis of genomic data obtained on multiple microarray platforms
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: The increasing pace at which DNA microarray-based genomic expression profiles are generated and published poses the issue of efficient and reliable comparison between datasets obtained by different laboratories and on different microarray platforms. Statistical analysis of microarray data is in continuous evolution, and several procedures have been described for detection and weighing of systematic and random errors coming from the highly parallel -but poorly replicated- microarray expression data. However, data obtained from different microarray platforms may be of substantially different nature. This is particularly evident when comparing two commonly used platforms, spotted cDNA microarrays and High-Density Oligonucleotide (HDO) microarrays of the Affymetrix type. cDNA microarrays yield a reproducible ratio between two signals, deriving respectively from the reference and from the sample. Conversely, absolute signals tend to vary across microarrays. Therefore, cDNA microarray data have to be analyzed with statistics handling repeated measurements or paired data, such as paired T-test. In the case of HDO microarrays, an absolute signal level is obtained from each single mRNA sample. As a consequence, non-paired statistics have to be applied to this type of data. Given the intrinsic differences between cDNA microarrays, data analysis procedures have generally been developed on one of the two platforms and only in some cases adapted to the other, however without a specific focus on systematic comparison and validation across platforms. It is still unclear whether data obtained in the two systems can be treated, compared and eventually merged under a common analysis framework. We addressed these issues by generating expression profiles from the same RNAs with both microarray platforms and by developing an analysis procedure in which inter-platform differences in data treatment are reduced to the minimum essential. We then developed a novel statistical test specifically designed to handle multiple comparisons against the same reference condition (eg many points of stimulation against one unstimulated control). In the Multiple Comparison (MulCom) test, regulated genes are identified by a ‘tunable’ statistic test weighing expression change in each stimulation point against replicate variability calculated across the whole set of stimulation points.

40. Fogolari F, Tosatto SCE
Loop predictions using molecular mechanics/Poisson- Boltzmann solvent accessible surface area (MM/PBSA)
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: In many predictive tasks accurate free energy estimation is needed. The molecular mechanics/ Poisson- Boltzmann solvent accessible surface area (MM/PBSA) approach has proven to be one of the most accurate. However, the correlation between the estimated free energy and the distance (e.g. root mean square deviation (RMSD)) from the most stable conformation is hindered by the strong free energy dependence on minor conformational variations. In the present paper a protocol for MM/PBSA free energy estimation is designed and tested successfully on several loop decoy sets. Further integration of MM/PBSA free energy estimator with the "colony energy" approach makes the correlation between free energy and RMSD from the native structure apparent, thus making the method both accurate and robust.

41. Staiano A, Tagliaferri R, De Vinco L, Longo G
Advanced Data Mining Methodology Based on Latent Variable Models
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Aim of this paper is to show a powerful tool for data mining activities based on a nonlinear latent variable model, i.e. Probabilistic Principal Surfaces (PPS). PPS builds a probability density function of a given data set of patterns, lying in a D-dimensional space, which can be expressed in terms of a limited number of latent variables lying in a Q-dimensional space. Usually, Q is 2 or 3 dimensional and thus the density function is used to visualize the data in the latent space. PPS have been fruitful exploited for classification as well as visualization and clustering of complex real high-D data and represents a promising data mining tool for researchers in genetics and bioinformatics.

42. Antoniol G, Ceccarelli M
A Computational Intelligence Approach to Unsupervised Microarray Image Gridding
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Image analysis is an essential aspect of microarray experiment: measures over the scanned image can substantially affect successive steps such as clustering and identification of differentially expressed genes. Scanned microarray image processing has three main tasks: (i) gridding, which is the process of assigning the coordinates to the spots, (ii) segmentation, it allows the separation between foreground and background pixels, and (iii) intensity extraction. Most of available gridding approaches require human intervention, for example to specify some points in the spot grid or even to register individual spots. Automating this part of the process will allow high throughput analysis. The paper reports a novel approach for the problem of automatic gridding in Microarray images. The method uses a two step process. First a regular rectangular grid is superimposed on the image by interpolating a set of guide spots, this is done by solving a non-linear optimization process with an evolutionary approach. Second, the interpolating grid is adapted, with Markov Chain Monte Carlo method, to local deformations. This is done by modeling the solution as a Markov Random Field with a Gibbs prior possibly containing first order cliques (1-clique). The algorithm is completely automatic and no human intervention is required, it efficiently accounts grid rotations and irregularities.

43. Rossi V, Picco R, Vacca M, D'Urso M, D'Esposito M, Galli T, Filippini F
Novel sequence patterns specific to VAMP subfamilies
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: In eukaryotic cells, SNARE proteins of the vesicle or target membrane (v- or t-SNAREs) play a central role in the control of membrane fusion and protein and lipid traffic. SNAREs’ coiled-coil domains (CCDs) have probably evolved from a common ancestor with a hydrophobic heptad register, interrupted by a conserved polar residue at the ionic “zero” layer. Depending on the nature of such residue, SNAREs have been reclassified as either Q- or R-SNAREs. R-SNAREs consist of two subfamilies: (i) short VAMPs or brevins (from the latin word “brevis” = short), and (ii) long VAMPs or longins, sharing a conserved N-terminal Longin Domain. Distinct amino acid patches are likely to determine specificity of SNARE pairing by reducing structural integrity when mismatched SNAREs interact. When considering pairing of the Q- and R-SNARE CCDs, an asymmetric ‘‘complementarity’’ is found in layers -3, -2, and +6, where bulky side chains are packed together with smaller ones, possibly enforcing the correct register between the CCDs of the fusion complex. Sequence variation in the SNARE domains, by altering local charges at the interaction layers, is likely to mediate a fine modulation of the interaction specificity and/or kinetics, regulating intramolecular binding as well as binding to a growing family of SNARE-interacting factors. Although the structure of the SNARE complex is evolutionarily conserved, biological specificity is probably mediated mainly by accessory proteins recognizing different CCD surface patterns of charges, polar and nonpolar side chains different between the endosomal and neuronal complexes. Recently, it has been reported that the interaction among acidic surface residues from the SNAREs and basic residues over the concave surface of α-SNAP is crucial to the disassembly of the complex.

44. Gaiji N, Mazzitello R, Beringhelli T, Fantucci P
Bovine β-lactoglobulin: Interaction studies with Norfloxacin
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Molecular docking is an efficient computational tool to predict the structures of protein-ligand complex. This kind of simulation is of fundamental importance for interpretation of numerous biochemical phenomena, providing useful information on the preferred binding sites of ligands, and therefore in rational drug design. Bovine β-lactoglobulin (BLG) is a small extracellular protein belonging to the lipocalin superfamily. Lipocalins have been classified as transport proteins with the remarkable ability of binding small hydrophobic molecules within the central cavity also known as calyx. Because of its stability, abundance and easiness of preparation BLG, has been frequently studied to clarify its structural and binding features. Several studies suggest that more than one binding site exists, thus the aim of this work is to investigate the existence of other sites, in addition to the calyx one, and to verify if BLG can interact and play the role of carrier of drugs. We considered the particular case of Norfloxacin which is a broad-spectrum antibiotic used in treatment of urinary tract infections.

45. Manzoni R, Sacco E, De Gioia L, Vanoni M
Hydrophobic network between AB and HI hairpins suggests a new role for AB hairpin in GEF action mechanism
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: The analysis of protein 3D structure is an important step to understand their mechanism of action, regulation, function and family’s belonging. Experimental methods for proteins structure determination don’t keep up with the increasing number of genomic sequence available: this led to an increase of computational methods that predict three-dimensional model for a protein of unknown structure (target) on the basis of sequence similarity to proteins of known structure (templates). There are different kinds of Homology Modelling methods, but all of them can’t recover from an incorrect target-template alignment: a good alignment is the first think to be considered when we’re talking about model’s confidence. SWISS MODEL, an automated comparative protein modelling server starts with the analysis of the structural conserved regions in the target-templates alignment. Ras protein are highly conserved GTPase playing a pivotal role in different important cellular events: cell proliferation, differentiation, cellular traffic and cytoskeleton organization. Within cells, Ras proteins exist both in a GTP-bound form (“on” state) or a GDP-bound (“off” state). The level of the GTP-bound state derives from the balance of the activity of the GTPase Activating Proteins (GAPs) and Guanine nucleotide Exchange Factors (GEFs). Common feature of all Ras GEFs is the presence of a domain, the RasGEF domain, carrying all the main structural features needed to interact with Ras and to exchange the nucleotide. A notable feature of this catalytic domain is the protrusion of a hairpin, formed by helices αH and αI, out of the core of the domain. It has been proposed helix αH plays an important role in the nucleotide-exchange mechanism opening up the nucleotide-binding site.

46. Mapelli V, Accardo E, Fantinato S, Sacco E, De Gioia L, Vanoni M
Structure-based hypothesis on active role of RasGEF αG-helix
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Ras proteins are small GTPases ivolved in signaling pathways controlling cell growth and differentiation. They act as molecular switches by cycling between an active GTP- and an inactive GDP-bound state. Following the activation of specific cell-surface receptors, Ras proteins switch from inactive to active state through the catalytic action of specific Guanine nucleotide Exchange Factors (GEFs), that promote the dissociation of GDP from Ras, allowing GTP entrance into the Ras nucleotide poket. The Saccharomyces cerevisiae Ras-GEF Cdc25 (Cdc25Sc) was the first Ras-exchanger to be identified. In higher eukaryotes there are two different classes of Ras-specific Cdc25Sc homologs, Sos proteins and Cdc25Mm, also referred to as Ras GRF. Ras-specific GEFs are made of several functional and structural domains, Ras GEF activity is contained within a domain showing very high similarity to the Cdc25Sc catalytic domain and called, for this reason, Cdc25 homology domain. Structural studies on Ras crystallized in complex with nucleotide (GDP or GTP-analogs) and human exchange factor Sos respectively have allowed both to identify conformational differences between active and inactive state of Ras, and to make hypothesis on molecular determinants of interaction and catalytic activity of human Sos. Mutational and structural studies on Ras GEFs catalytic domain have pointed to a major role for the helical-hairpin formed by αH and αI helixes (catalytical hairpin) in the catalytic mechanism of Ras-specific GEFs. In the present work we investigate the Ras GEF αG-helix role in Ras-GDP to GTP exchange.

47. D'Ursi P, Rovida E, Merati G, Biguzzi E, Caprera A, Milanesi L, Faioni E
Computational analysis of naturally occurring protein C mutants: electrostatic properties implications.
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Activated Protein C (APC) is a vitamin K-dependent anticoagulant plasma serin protease that exerts its action through the inactivation of factors Va and VIIIa in presence of Ca++ and phospholipids. Deficiency of protein C is associated with the risk of developing venous thrombosis. APC shares homologies with other vitamin K-dependent coagulation proteins as a results of a common evolutionary pathway. The chymotrypsin-like serine proteases maintain a strictly conserved active site geometry among their catalytic Ser, His and Asp residues. The fact that this core is highly conserved both in sequence and structure among members of the serine protease family suggests that its shape has been finely tuned during evolution. 33 mutations (18 novel) in the promoter and coding regions of the PC gene were identified by PCR and sequencing in 46 patients reporting venous thromboembolic events. Here we present a computational analysis of three selected mutants (G43E, D194N, G216D) that are localized in the catalytic domain and determine a charge modification in the vicinity of the catalytic triad.

48. Mutarelli M, Basile W, Cicatiello L, Scafoglio C, Colonna G, Weisz A, Facchiano AM
Comparative analysis with three different microarray platforms of the oestrogenresponsive transcriptome from breast cancer cells
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: The DNA microarray technique makes it possible to analyze the expression patterns of tens of thousands genes in a short time. The wide use of this technique and the rapidly improving different technologies available by several commercial and academic providers has led to the publication of thousands of results, extremely heterogeneous with respect to the type of technology used, to the kind of normalization and analysis subsequently applied to data an so on. This leads to a difficulty in collaborating and exchange data between groups with common research interest, whereas collaborations would be extremely useful due to the high cost of this techniques but also to the consideration that an experiment carefully designed could bring results relevant to different groups, each focusing on a different aspect of a main biological problem. So the awareness for the need of common standards or, at least, comparable technologies is emerging in the scientific community, as shown by the effort of the on-purpose Microarray Gene Expression Data (MGED) Society, which is trying to set up at least experimental methodology, ontology and data format standards. In addition, it is important the ability of being able to compare newly produced data with preceding experiments, so to ensure of keeping high the value of results produced with equipment of the old generation. Otherwise, a large amount of the work produced until the outcome of a new release of technology would be lost. This, considering that the huge amount of data produced is largely underexploited, would be a great loss for the scientific community. In fact, as analysis algorithms are improving, existing data can be re-analyzed to give more precise results, thus helping to adjust the planning of future experiments. We thus started this work with the aim of evaluating the technical variability between three commonly used microarray platforms, such to adapt the first part of the analysis to the peculiarity of each technique, and the feasibility of a common subsequent analysis path, thus taking advantage of the different data-extraction abilities of the three. For this purpose, we used three different commercial chips to study the gene expression profiles of hormone-responsive breast cancer cells with and without stimulation with estradiol: i) the Incyte ‘UniGEM V 2.0’ microarrays, containing over 14,000 PCR-amplified cDNAs, corresponding to 8286 unique genes, spotted at a high density pattern onto glass slides; ii) the Affymetrix technology, based on 25 nucleotide-long oligonucleotides directly synthesized on a GeneChip® array, representing more than 39,000 transcripts derived from approximately 33,000 unique human genes; iii) the Agilent ‘Human 1A Oligo’ Microarray consisting of 60-mer, in situ synthesized oligonucleotide probes for a total of about 18000 different genes. The RNA derived from human breast cancer cells (ZR-75.1) stimulated for 72 hrs with 17beta-estradiol after starvation in steroid-free medium for 4 days; the reference sample was derived from synchronized cells grown in steroid-free environment. The same samples were used to generate fluorescent targets to be hybridized on the different slides. Hybridization reactions were performed with four (for the Agilent slides) and two or three (for the other platforms) technical replicates, with a single (Incyte) or double (Agilent), balanced dye swap for competitive hybridizations. A total combined number of 18,823 unique UniGene clusters were represented among the three platforms used. By focusing only on a subset of 5,733 genes that were present in all the chips, about 50% appeared to be significantly expressed and 25% genes resulted significantly regulated by 17beta-estradiol treatment in our experiment. A quite low overlapping was observed between the lists of regulated genes obtained by the three systems. We are working on understanding the conflicting results on some of the genes. The majority of genes were detected by only the Affymetrix platform, probably as a consequence of the higher sensitivity of this system, which allows the detection of some gene expression levels that are not identified with the other platforms. However, a number of genes was identified only by the cDNA and/or oligonucleotide systems. Another possible experimental explanation is that the DNA sequences spotted on the arrays show different affinity for the target, so each slide has a particular pattern of probe-target annealing, although the same genes are represented on all the platforms. Finally, we are improving the data processing by statistical methods in order to allow the better understanding of the experimental results.

49. Catalano D, Licciulli F, Grillo G, Liuni S, Pesole G, Saccone C, D'Elia D
MitoNuc: a database of nuclear genes encoding for mitochondrial proteins
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Mitochondria are sub-cellular organelles, present in the majority of eukaryotic organisms, which play a central role in the energy metabolisms of cells. They are also involved in many other cellar processes such as apoptosis, aging and in a number of different human diseases, including Parkinson’s, diabetes mellitus and Alzheimer’s. Despite to their importance in the cell life maintenance, about the 95% of proteins, contributing to mitochondrial biogenesis and functional activities, are nuclear encoded, synthesized in the cytosol and targeted to mitochondria. The expression and assembling of these proteins are strictly dependent by the coordinated activities of the two genomes, mitochondrial and nuclear, but the molecular mechanisms and co-evolutionary processes of the cross-talk between these two genomes are still largely unknown. MitoNuc is a specialized database of nuclear encoded mitochondrial proteins in Metazoa. It provides comprehensive data on genes and proteins consolidating information from external databases. These data include: gene sequence, structure and information from ENSEMBL, protein sequence and information from SWISSPROT, transcript sequence and structure from RefSeq and UTRdb, disease information from OMIM. Each database entry consists of a nuclear gene coding for a mitochondrial protein in a given species, and reports information on: species name and taxonomic classification; gene name, functional product, sub-cellular mitochondrial localization, protein tissue specificity, Enzyme Classification (EC) code for enzyme and disease data related to protein dysfunction. For each gene and gene product the Gene Ontology (GO) classification with regard to molecular function, biological processes and cellular component is reported too. Links to external database resources are also provided. As far as the gene and transcript sequences data are concerned, in the previous MitoNuc releases they were extracted from the EMBL related entries. Due to the high level of sequences redundancy in the primary database, the majority of MitoNuc entries contained more than one transcript and coding gene sequence for the same gene, thus introducing a remarkable redundancy level that affects the effectiveness of the database for sequence analysis aims. In order to remove redundancy we generated a MitoNuc section of gene and transcript sequences derived from those organisms whose genome sequence draft has been completed and annotated in ENSEMBL. These MitoNuc entries are available in the database section called “MitoNuc Genomics” that, at present, include the following species: Homo sapiens, Rattus Norvegicus and Mus Musculus. MitoNuc can be queried using the SRS Retrieval System (http://www.ba.itb.cnr.it/srs/); the present release contains a total of 1344 entries among which 662 are collected in the MitoNuc Genomic section. The total number of species included in MitoNuc is about 64.

50. Lazzari B, Milanesi L, Stella A, Caprera A, Bianchi F, Vecchietti A, Pozzi C
ESTree DB: a Tool for Peach Functional Genomics
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: A collection of about 8000 Expressed Sequence Tags (EST) sequences has been prepared starting from clones belonging to four cDNA peach libraries. Libraries have been prepared from Prunus persica mesocarps at four different developmental stages with the aim to collect data for deep investigation of the maturation process at the molecular level. A fully automated pipeline (ESTree DB) has been prepared to process EST sequences using public software integrated by in-house developed Perl scripts and data have been collected in a MySQL database called ESTree available at this URL: http://www.itb.cnr.it/ESTree. These data are produced in the frame of the activities of the National Consortium for Peach Genomics (ESTree), involving also the Universities of Padova, Udine and other research Institutions.

51. Bonizzoni P, Dondi R, Rizzi R, Pesole G
ASPIC: a Novel Method to Predict Alternative Splicing
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: In this paper a new method for detecting splicing sites is proposed. It is based on a combined analysis of all available transcript data in order to produce all transcript alignments to the genomic sequence. The algorithm requires that all transcript-genome alignments are fully compatible with a plausible common exon-intron structure within the genomic sequence. The algorithm was implemented in the ASPIC (Alternative Splicing PredICtion) software.

52. Marangoni R
Simulating genes families
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: After their complete sequentiation, genomes are clusterized in genes families, the members of which share a significant similarity in their sequences (and often in the structures of their proteic products) but they are often playing different biological roles. When there is such a relationship between two genes, they are called paralogs. It is of general believe, that paralogs genesis is due to an iterate mechanism of gene duplication with subsequent modification of the copies. In a previous work describing a method to reconstruct the history of genes families, a simulator of genes families was introduced in order to bypass the lack of experimental data about genes families history. Working with these simulated data, some interesting features concerning real biological families has been found. Nevertheless, they have not been explored, since they were too far from the main subject of that paper. In the present work, a simulator similar to that used in the above cited paper has been developed, and many different synthetic data have been generated. The simulation strategy, the biological foundation of it and the comparison between simulated and real sequences are discussed in detail in the poster.

53. Carrara GE, Stella A, Pinciroli F, Alcalay M, Masseroli M
Automatic extraction of gene annotations from data-rich HTML pages
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: High-throughput technologies create the necessity to integrate the resulting gene expression data with information mined from large amounts of gene annotations within several different biomolecular databanks. Most of these databanks can be queried only via web, for a single gene at a time, and query results are generally available in HTML format. Although some databanks provide batch retrieval of data via FTP, this requires expertise and resources for locally re-implementing the databank. Web wrappers can automate extraction of the information of numerous genes from different web-based databanks. As the content of a dynamic web page can change from one query to another (e.g. tables with extra rows or missing fields), such wrappers should be able to locate and extract data of interest inside different HTML pages. Unfortunately, HTML tags describe the visual formatting of data, not their semantics. Thus, human-readability and machinereadability are often not equivalent. Wrapper generation tools help creating a wrapper for a specific source, i.e. a web-based biomolecular databank with its own HTML layout. First, the user is invited via a Graphic User Interface to select data of interest inside one or more sample HTML pages. Then, the system saves this information as an extraction template for that specific source. The long term goal is to generate wrappers that scale well with the number of processed web pages.

54. Cappadona S, Diestellhorst L, Kemp G, Cerutti S
Analysis of β-helix proteins using the STACK toolkit
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: β-helix proteins contain a solenoid domain of parallel β-strands folded into a large prism. Each turn of the solenoid, called a β-coil, consists of a succession of a few (usually three) β-strands. β-strands from adjacent coils stack to form parallel β-sheets that make up the faces of the prism. These faces are linked by loop regions that protrude from the helix and, in many cases, form the binding site of the helix. The cross section of this prism is typically L-shaped in right-handed parallel β-helices and triangular in left-handed parallel β-helices. Left-handed and right-handed β-helices have a different cross section The stability of the domain is mainly obtained by the stacking of similar residue side-chains at equivalent positions in successive coils, both inside and outside the helix. The inward side chains are mainly hydrophobic and, when not, maximal hydrogen bonding or electrostatic interactions neutralise their polar or charged groups. We have formalised the intuitive notion of a β-helix in a set of objective algorithms that recognize automatically the basic structural elements of β-helices: residue stacks, β-coils, cores and β-helices. We define the core of a β-helix as the helical domain of the protein, as distinguished from the protruding loop regions.

55. Ferraro E, Ausiello G, Panni S, Cesareni G, Helmer-Citterich M
Definition of a neural strategy for the prediction of protein interaction specificity
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: We are working at the development of a neural network strategy for the prediction of peptide recognition specificity by SH3 domains. As a training set we use the results of a large number of SH3-peptide binding experiments obtained by the SPOT synthesis technique (PepSPOT). As input for the neural network, we consider the sequence of both the domain and the hypothetical ligand peptide, in order to infer for each domain peptide combination the likelihood that they form a complex in a binding reaction. The method will be applied to predict the affinity of any peptide for domains of unknown specificity. We analyzed data from PepSPOT experiments for nine SH3 domains each tested against several hundred peptides: we decided to construct a proper dataset where each data point includes the domain and peptide sequence, and a figure in arbitrary BLU units that correlates with binding affinity. In order to translate this information in a format that can be easily captured from a neural network, we focused on three main problems: i) the information coding; ii) the dimension of the input space; iii) the correct identification of the two classes (binding and not binding). We decided to use the orthogonal representation of the sequences and, in order to reduce the huge dimensionality, of the domains residues we only considered those positions that make contact with the ligand peptide. The contact positions are identified from the analysis of the SH3-peptide complexes of known structure and extended to other SH3 domains of known sequence by multiple alignment. For the peptide sequences we restricted our representation to the most significant positions, excluding the two consensus prolines from the input. Finally we identified the binding class considering all the peptides that show spot intensity higher than 10000 BLU units. The resulting dataset was strongly unbalanced and this implies the pursuit of different methodological strategies: usual feed-forward neural networks requires the balancing of the training set, while kernel methods (support vector machine) perform classification even on unbalanced sets but with the correct choice of a non-linear kernel. We will verify the performance of the neural strategy with respect to regular expressions, position weight matrices, position specific scoring matrices (PSSMs) and the SPOT procedure.

56. Bansal M, Di Bernardo D
Inferring gene regulatory networks from time expression profiles
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Recent developments in large-scale genomic technologies, such as DNA microarrays and mass spectroscopy have made the analysis of gene networks more feasible. However, it is not obvious how the data acquired through such method can be assembled into unambiguous and predictive models of these networks. In a recent study our group developed an algorithm (Network Identification by multiple regression – NIR) that used a series of steady state RNA expression measurements, following transcriptional perturbations, to construct a model of a 9 gene network that is a part of larger SOS network in E.Coli. Though the NIR method proved highly effective in inferring small microbial gene networks, its practical utility is limited because it requires: (i) prior knowledge of which genes are involved in the network of interest; (ii) the perturbation of all the genes in the network via the construction of appropriate episomal plasmids; (iii) the measurement of gene expressions at steady state (i.e., constant physiological conditions after the perturbation). This experimental setup is unpractical for large networks, it is not easily applied to higher organisms, and, most importantly, it is not applicable if there is no prior knowledge of the genes belonging to the network. Here we are proposing a new algorithm that can infer the network of gene-gene interactions to which a gene of interest belongs and identify its direct targets, using the perturbation of only one of the genes in the network. To this end, we need to measure gene expression profiles at multiple time points following perturbation of only the known gene, or genes, and without the need of the steady-state assumption.

57. Malerba G, Trabetti E, Sandri M, Xumerle L, Cavallari U, Galavotti R, Biscuola M, Patuzzo C, Pignatti PF
Single and multilocus analyses for the identification of at risk genotypes in cardiovascular disease
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Parental history of coronary heart disease (CHD) has long been recognized as a risk factor for CHD. Death from coronary heart disease is influenced by genetic factors both in women and men. Several epidemiological studies have described a number of underlying risk factors for cardiovascular disease (diabetes mellitus, hypercholesterolemia, plasma lipids, hypertension) which are as well under a moderate degree of genetic control. In searching for susceptibility genetic factors associated to coronary artery disease (CAD) we determined the genotypes for 35 candidate genes (63 polymorphisms) in a sample of 757 individuals with angiographically documented coronary artery disease (CAD+, cases), and 320 individuals with angiographically documented normal coronary arteries (CAD-, controls). It is very hard to discover true combinations of multiple factors contributing to the disease. Recent publications show a growing number of genes being studied and correlated with phenotypic variations. The difficulties in treating the increasing amount of available data indicate the need for new tools able to retrive the relevant information. We propose the implementation of the classification tree procedure joined to backward elimination as an explorative tool to screen for genetic factors that may be associated to the CAD phenotype.

58. Di Bernardo D, Gardner TS, Collins JJ
Drug Target Identification from Inferred Gene Networks: a computational and experimental approach
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Genome-wide gene expression profiles provide a means to discover the direct mediators of biologically active compounds. We have already shown that it is possible to infer a predictive model of a genetic network by overexpressing each gene of the network and measuring the resulting expression at steady state of all the genes in the network. This approach however requires the perturbation of each gene and the measurement of the perturbation magnitude. In this work we explored the possibility of inferring predictive models of large genetic networks without requiring the knowledge of which genes have been perturbed and by what amount. The network identification algorithm here described allows to infer a model of a genetic network from perturbation experiments for which the perturbed genes are not known. This model can be used to identify the target gene, or genes, of a given drug.

59. Amici R, Bartocci E, Merelli E
A virtual laboratory for simulating metabolic pathways
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Generally, a biological system consists of interconnected processes cooperating to carry out the global behaviour of the system, by defining functional rules and relationships between the subunits. This kind of processes organization leads to a dynamic system model based on the temporal evolution of its parameters. The difficulty to establish a priori the response to new stimulus from the environment increases the complexity of this kind of systems. Among the great number of biological systems, that we can find in nature, we consider metabolic pathways, that are a collection of enzymatic processes involved in the transformation of several substances. Visiting the KEGG web site1 it’s possible to view the available pathways; we choose to study the citric acid cyclic process drawn in Figure 1 and we propose a virtual laboratory for simulating the behaviour of the selected pathway.

60. Puntervoll P, Linding R, Gemund C, Chabanis-Davidson S, Mattingsdal M, Cameron S, Martin DMA, Ausiello G, Brannetti B, Costantini A, Zanzoni A, Maselli V, Via A, Cesareni G, Diella F, Superti-Furga G, Wyrwicz L, Ramu C, McGuigan C, Gudavalli R, Letunic I, Bork P, Rychlewski L, Kuster B, Helmer-Citterich M, Hunter WN, Aasland R, Gibson TJ
Eukaryotic Linear Motifs in the ELM Web Tool
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Reflecting the modular nature of eukaryotic proteins, several WWW servers (e.g. PFAM, SMART, PROSITE) are dedicated to revealing domains in protein sequences. However, there is no resource, which specifically focuses on short functional motifs (targeting peptides, docking modules, glycosylation sites, phosphorylation sites, etc), yet these modules are just as important for function as the larger protein domains. Domains are identified by conventional methods, such as patterns (regular expressions) profiles or HMM models. But statistically robust methods cannot usually be applied to small motifs, while pattern-based methods over-predict enormously so that the few true motifs are lost amongst the many false positives. ELM (Eucariotic Linear Motifs - http://elm.eu.org) [1] is a new web based tool for the prediction of these small motifs on eukaryotic protein sequences. At the moment, the ELM database contains manually curated information about 114 known linear motifs in the form of regular expressions, profiles or hidden markov models that identify the motifs on the sequence. ELM addresses the over prediction deficiency of other methods by the use of context-based rules and logical filters that exclude false positives. The current version of the ELM server provides core functionality including filtering by cell compartment, phylogeny, globular domain clash (using the SMART/Pfam databases), secondary structure, and solvent accessibility. The current set of motifs is not at all exhaustive. Filters work by comparing the information on the motifs stored in the db (taxonomic, structural and cellular context) with the information submitted by the user together with his sequence. The structural filter works by automatically modeling the submitted protein sequences, whenever a good template is found in the SCOP database, and comparing predicted solvent accessibility values and secondary structure features with the corresponding values associated to ELM matches on true positive structures. The ELM server was launched on November 2002 and regularly enhanced since then. The server activity has been running for several months at > 45,000 hits from > 1700 unique internet sites.

61. Ceroni A, Frasconi P
Using Constraints on Beta Partners to Reconstruct Mainly Beta Proteins
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: The knowledge of the spatial conformation of a protein can help the study of its function, but the number of resolved structures is still limited by the low throughput of the methods used. Structure prediction could bridge the sequence-structure gap, but no reliable and general methods have yet been proposed. An attempt to simplify the problem has been made by trying to predict the contact map of a protein instead of its atoms positions. It has been demonstrated the protein structure can be reconstructed with sufficient precision even if the contact map contains error. Unfortunately, the prediction of contact maps is still very unreliable and it is not clear whether the type of errors made by the predictor can be corrected by the reconstruction method. A low-detail representation of the protein conformation could extract the relevant information to train more efficient predictors. The coarse-grain contact map is defined using contacts between secondary structure segments. The prediction of this type of contacts has been tried, but no results exists about the feasibility of a reliable method that uses only this type of information to reconstruct the protein structure. In this work we concentrate on contacts defined by beta partners. The geometry and connectivity of beta strands imposes strong constraints on the overall structure of the protein, especially for those chains thar are formed mainly by residues in beta conformation. The reconstruction of the structure of this kind of proteins would be enhanced by the knowledge of the secondary structure and the indication of which strands are partners. We propose here an efficient procedure to find a structure that matches the aforementioned characteristics of a given protein in its native conformation.

62. Di Dato V, Di Lauro R, Chiusano ML
Comparative genomics to identify regulatory regions: an example from the PAX8 gene
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Comparisons between human and rodent non coding sequences are widely used for the identification of highly conserved sequences that could suggest functional implications. In particular, intergenomic comparisons are rapidly evolving for investigations on regulatory regions involved in promoter activity. Moreover, the efficacy of such comparisons for the identification of functional regulatory elements, can be of help also in the study on the evolutionary dynamics of promoter sequences. We are conducting computational analyses, based on comparative genomics between Homo sapiens and Mus musculus, on regions of at list 200kb spanning the entire genomic locus of genes involved in tyroid differentiation, to understand their expression mechanisms and regulation. A preliminary study on the PAX8 gene was supported by experimental analysis. The analysis resulted in the identification of 91 conserved regions of which 35 located at the 5’ of the gene were chosen to start the experimental analysis. They were tested for functional implications in PAX8 promoter activity leading to the identification of tyroid specific regulatory regions. The results of the current analysis provide experimental evidences that in turn have three fundamental perspectives: to help the clarification of the mechanisms of regulation and expression of the genes investigated; to improve the computational methodology proposed and strengthen its predictive power; to validate the computational approaches for the analysis of transcription factor binding sites, giving more hints to understand their organization and the pattern of evolution in regulatory sequences.

63. Capriotti E, Fariselli P, Rossi I, Casadio R
Improving the Detection of Protein Remote Homologues Using Shannon Entropy Information
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: We analyze the quality of the alignment generated by the profile-profile alignment comparison algorithm known as BASIC and compare the results with those obtained with a structural alignment code. By this we compute that a Shannon entropy value > 0.5 gives a sequence to sequence alignment of the target/template couple comparable to that obtained with the structural alignment performed with CE. In our fold recognition/threading code Tangram, the BASIC profile-profile alignment is implemented as follows: 1. The composition profiles PA and PB for the target and template are generated by multiple alignment of the sequences obtained from a three-iteration PSI-BLAST search on the Non-Redundant database (the inclusion threshold is E=10-3). 2. the dot matrix (D) for the profile comparison of two protein sequences D= PTA S PB, (with S=BLOSUM62 substitution matrix) is computed using linear algebra routines. 3. the D matrix is searched for high-scoring alignment by means local Smith-Waterman dynamic programming algorithm. The test set used for the evaluation is composed by 185 template/target couples of PDB structures that share the same SCOP label, but have less than 30% sequence identity When the top-scoring alignments for each target protein in the test set is considered, our BASIC implementation detects the full SCOP label for 125 couples (68%) and generates 114 (62%) alignments with a MaxSub score >=1. Interestingly, it is found that nearly all of the high-quality alignments share a common feature: the average Shannon entropy for the profile sections aligned together is greater than 0.5 for both the template and the target. If only the top scoring alignments for which this condition holds are considered, a subset of 119 alignments is selected, and for 116 of them (97%) the full SCOP label can be assigned to the target, while 108 (91%) gets a nonzero MaxSub score, with an average score of 4.6 MaxSub on the subset On the same 119 couples, the structural alignment program CE computes a nonzero MaxSub score for 116 of them, with an average of 5.7 points. These results indicate that the Shannon entropy value can be used to discriminate a subset of sequence profile-profile alignments of quality comparable to that obtained by means of a structural alignment program.

64. D'Alessandro L, Felice B, Montemurro F, Medico E
Meta-analysis of multiple microarray datasets reveals a novel genomic signature associated to invasive growth of epithelial cells and early breast cancer metastasis.
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: HGF, also known as “Scatter Factor”, is a mesenchymal cytokine that acts on epithelial and endothelial cells by promoting a highly integrated biological program, hereafter referred to as “invasive growth”. This program involves coordinated control of basic cellular functions including dissociation and migration (“scattering”), invasion of extracellular matrix, proliferation, prevention of apoptosis and polarization. As a consequence, complex developmental processes take place, such as branched morphogenesis of epithelia and angiogenesis. Oncogenic activation by overexpression or point mutation of the gene encoding the tyrosine kinase receptor for HGF, c-MET, is involved in the progression of tumors towards the invasive-metastatic phenotype. To identify genes involved in Met-driven invasive growth, we explored the transcriptional response of mouse liver cells to HGF at different time points. Two different microarray platforms were adopted, consisting respectively of high-density spotted cDNAs (Incyte) and in-situ synthesized oligonucleotides (Affymetrix). Global exploration of 25’000 gene transcripts yielded over 1500 transcriptionally regulated sequences, corresponding to genes involved in the control of the basic biological functions underlying the invasive growth program: transcription, signal transduction, apoptosis, proliferation, cytoskeleton organization, motility and adhesion. Joint analysis of the data obtained by the two platforms allowed identification of genes with more consistent and reproducible regulation. Meta-analysis on genomic expression datasets obtained from breast carcinoma showed that expression of genes belonging to the HGF signature is correlated to cancer progression.

65. Menozzi G, Riva L, Sironi M, Pozzoli U
Intron and exon lengths influence on splicing
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Splice site consensus values (CVs) are usually calculated using previously described matrices [1] which are obtained through the analysis of a relatively small splice site number (1500) from different organisms. Now, genome annotation becoming complete, a much more accurate definition is possible. Furthermore, recent studies [2, 3, 4] indicate that consensus value itself is not sufficient to define splice site strength and other parameters must be considered to improve splice site definition. To investigate how intron and exon lengths might be exploited by the splicing machinery to ensure proper splicing control and regulation a human intron database has been developed and analyzed.

66. Riva L, Menozzi G, Sironi M, Cerutti S, Pozzoli U
A Wavelet Based Method to Predict Nucleosome Positions
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: The nucleosome core particle is the fundamental repeating subunit of chromatin. It consists of two molecules each of the four ‘core histone ' proteins, H2A, H2B, H3 and H4, and a 147 bp stretch of DNA. A better knowledge of the chromatin nucleosomal organization is crucial to understand many important phenomena occurring in chromosomes. Regulatory mechanisms of gene expression are partially influenced by nucleosome positioning and regions with exposed chromatin (i.e. where nucleosomes are more distant) can be more prone than others to double strand breaks. Analysis of nucleosomal DNA has demonstrated the existence of a weak sequence-dependent signal for nucleosome positioning, this makes classical computational biology methods, like alignment and consensus sequences, poorly applicable here. The ability of DNA to assume certain conformation in certain positions can considerably enhance its binding potential to nucleosomes. According to recent X-ray structure studies, the 147 bp nucleosomal DNA has detectable bends symmetrically displaced around the central position, this suggests the presence of localized periodicities in DNA bendability. Wavelet transform can be used to locally evaluate periodicities allowing to detect positions with a bend distribution similar to known nucleosomal DNA.

67. Galfrè S, Morandin F, Cozza A, Pellegrini S, Marangoni R
A method to improve microarray-based identification of SNPs
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Single Nucleotide Polymorphism (SNP) represents a variation in sequence (polymorphism) between individuals caused by a change in a single nucleotide. This process is responsible for most of the genetic variation between individuals. Furthermore, the identification of distinct SNPs may play a crucial role in assessing a potential genetic influence for those disorders that do not appear to have a simple genetic transmission. In turn, the identification of genetic risk factors may contribute to determine biological markers of disease that can be used for the preclinical diagnosis of a pathological condition. Early diagnosis is important for enacting successful therapeutic strategies. In order to obtain more informative data, multiple SNPs should be tested simultaneously in the same individuals. A common protocol used in SNPs investigations is based on Single Base Extension (SBE) followed by microarrays hybridization, in which each DNA sample is hybridized on two arrays: one used to explore the existence of “A” and “T” in the SNP locus, the other array for “C” and “T”. To obtain a global evaluation of the frequency with which each SNP is represented in the population, it is necessary to make a quantitative comparison of the signals recorded from the two arrays. Because of many technical reasons, during this step a large quantity of noise is introduced, thus compromising the reliability of the final data. Here we present a simple approach, based on the usage of three arrays instead of only two, which can address this problem. We also give a statistics method for data processing to be used with the proposed experimental protocol.

68. Pozzoli U, Menozzi G, Riva L, Sironi M
COBITIS: COmputational BIology Tools Interoperability Schema
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: L’utilizzo programmatico di algoritmi in biologia computazionale presenta quasi sempre parecchie difficoltà. Spesso gli autori scelgono di pubblicare i propri algoritmi mediante interfacce web che ne facilitano l’impiego da parte di un utente umano ma ne rendono impraticabile l’utilizzo all’interno di un processo di elaborazione più complesso da parte di un qualsiasi sistema software. Il problema è ancora più limitante quando l’algoritmo deve essere usato ripetutamente. Anche la disponibilità di versioni compilate o addirittura del codice sorgente non risolve completamente il problema. Infatti, a prescindere dalle difficoltà di installazione/integrazione, vi è pur sempre da risolvere il problema del formato in cui i dati sono richiesti e i risultati forniti. Una soluzione parziale e intuitivamente praticabile è la standardizzazione del formato dei dati. Molti tentativi sono stati fatti in questa direzione ma nessuno ha raggiunto lo scopo di definire un formato generalmente accettato e utilizzato se non in ambiti specifici o all’interno di singole organizzazioni. L’utilizzo di formati definiti mediante schemi XML consente agli algoritmi di identificare il tipo dei dati forniti. L’utilizzo di uno schema XML può risultare assai efficiente se, ad esempio, gli algoritmi possono comunicare mediante SOAP. Abbiamo sviluppato una serie di strumenti in C++ e in modo indipendente dalla piattaforma che consentono l’implementazione di algoritmi in grado di scambiare dati secondo COBITIS; un semplice schema XML. Tali strumenti consentono la trasformazione di dati da diversi formati a COBITIS, l’implementazione di applicazioni client e server che comunicano via SOAP consentendo l’utilizzo remoto e distribuito di algoritmi. In particolare abbiamo sviluppato un server accessibile mediante web services e due client: uno web che sfrutta XSLT per la visualizzazione dei dati risolvendo molti problemi nell’implementazione delle interfacce e uno che consente di accedere al server da Matlab. Riteniamo che, pur rinunciando a imporre qualsiasi ontologia sui dati, questo modello possa risolvere parecchi dei problemi relativi all’utilizzo programmatico di algoritmi in biologia computazionale.

69. Ceol A, Montecchi-Palazzi L, Persico M, Gavrila C, Castagnoli L, Cesareni G
The (new) MINT Database.
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Scientists recognize that a complete description of cell physiology requires an understanding of the “global” protein interaction network. Thus, a database that collects this information, which is presently dispersed in the scientific literature (or accumulated by high throughput experiments), is an essential post genomic tool. MINT was conceived a couple of years ago, as a collaborative effort between the group of Molecular Genetics and the students of the PhD program of Molecular and Cellular Biology of the University of Rome Tor Vergata, MINT is a relational database designed to store data on functional interactions between proteins, and aims at being exhaustive in the description of the interaction including information, whenever available, about kinetic and binding constants and about the domains participating in the interaction. Presently MINT focuses on experimentally verified interactions extracted from the scientific literature by curators, with special emphasis on mammalian organisms. The MINT protein interaction database offers to the scientific community, a unique bioinformatic tool to design and interpret their experiments.

70. Cannata N, Forcato C, Fabbro G, Pasin A, Balen J, Valle G
Searching for discriminating degenerated patterns between two populations of sequences
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: In this work we present the development of a bioinformatics tool aiming at the individuation of discriminating sequence patterns between two populations of sequences. Some examples in which it could be used are easy to find in genomics and proteomics: introns/exons in gene sequences, coding/non-coding in transcript sequences, proteins that are transported in some subcellular localization and those that are not. Once the patterns are detected they could be searched over non-annotated sequences from some program especially developed to find degenerated patterns. We expect that such a method, used jointly with other more traditional methods could lead to a better predictive power in annotation processes.

71. Vitulo N, Cestaro A, Vezzi A, Campanaro S, Simonato F, Lauro F, Malacrida G, Simionati B, Cannata N, Bartlett D, Valle G
Development of tools based on UCSC and KEGG for the annotation of the Photobacterium profundum genome
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: One of the critical steps in a genome sequencing project is the efficient data storage and retrieval of the large amount of information produced, which represents the starting point for data analysis and interpretation. We have recently completed the genome sequence of Photobacterium profundum strain SS9 and the data have been implemented in a genome browser under the UCSC enviroment. The UCSC genome browser has been developed at the University of California, Santa Cruz and CRIBI hosts one of their official mirror sites at http://genome.cribi.unipd.it. The sequence and annotation information is stored in a MySQL relational database and a web-based tool performs fast visualization and querying of the data. The records are displayed as a series of tracks aligned with the genomic sequence. The Photobacterium profundum genome browser contains the ORF prediction obtained by two different programs (Orpheus and Glimmer) and the related non-redundant ORF consensus, the ribosome, tRNA, operons, the clones spotted on the microarray chips, the differentially expressed clones derived from microarray experiments, the orthologous genes on other bacteria, the phage and a prediction of the repeated element on the genome.

72. Attimonelli M, Accetturo M, Scioscia G, Marinelli C, Leo P, Santamaria M, Mona S, Lascaro D, Cascione I, Tommaseo-Ponzetta M
HMDB, the Human Mitochondrial Data Base, a genomic resource supporting population genetics studies and biomedical research on mitochondrial diseases
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Population genetics studies based on the analysis of mtDNA and mitochondrial disease studies have produced a huge quantity of sequence data and related information. These data, classified as RFLPs, mtDNA SNPs, pathogenic mutations, HVS1 and HVS2 sequences, and complete mtDNA sequences, are at present distributed worldwide in differently organised databases and web sites, not well integrated among them. Several mitochondrial specialised databases and databases related with variability data have been designed and implemented, but generally they are structured as simple databases where data are stored, without the possibility to perform any analysis. Moreover it is not generally possible for the user to submit and contemporarily analyse its own data comparing them with the content of a given database and this is valid both for population genetics data, and for mitochondrial disease data. As far as population genetics data, for example, the problem of sequence classification in haplogroups is becoming more and more important as the improvement of sequencing technologies is increasing the availability of new complete mitochondrial genomes. Indeed up to now the only way to establish the haplogroup paternity of a given mitochondrial sequence is to manually observe its variant sites respect to a reference sequence, referring to literature in order to define its haplogroup-specific polymorphisms. Also as far as mitochondrial disease data, despite the large number of disease-associated mutations already discovered in the last few years, the sequencing of the complete human mt genome is allowing the discovery of new pathogenic mutations. Indeed, up to now, the pathogenicity of mtDNA mutations has been, in most cases, prevalently validated by their segregation with the disease and their consequent loss of function when the mutation involves a structural gene. However, no systematic statistical analysis of the mtDNA SNPs has been performed until now. Here we present the design of a Human Mitochondrial genome DataBase (HMDB) that will collect the complete human mitochondrial genomes publicly available interfaced to analysis programs, allowing the classification of newly sequenced human mitochondrial genomes, and the prediction, through site-specific nucleotidic and aminoacidic analysis[, of the pathogenic potential of mitochondrial polymorphisms.

73. Attimonelli M, Accetturo M, Lascaro D
Statistical prediction of pathogenic variant sites in human mitochondrial genomes
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Mitochondrial DNA disorders – disorders associated with dysfunctions of the oxidative phosphorylation system (OXPHOS) – are caused by inborn metabolism errors and have an estimated frequency of 1 out of 10000 live births. Due to the relevant role played by the OXPHOS system in ATP production, causes and effects of mitochondrial disorders are highly heterogeneous and complex. Major origin of mitochondrial disorders is in both nuclear and mitochondrial DNA mutations. Although prenatal diagnosis is routine for nuclear DNA mutations, the cases of prenatal diagnosis of mtDNA mutations are rare, even though urgent, as no real therapies exist. However thanks to bioinformatics support, the gap may be reduced in a short time. Indeed, up to now, the pathogenicity of mtDNA mutations has been, in most cases, prevalently validated by their segregation with the disease and their consequent loss of function when the mutation involves a structural gene, but no systematic statistical analysis of the mtDNA SNPs has been performed. Moreover the criteria commonly followed to associate a mutation to a given pathology are: - aminoacidic change in a strictly conserved site; - presence in patients only; - heteroplasmy condition; - presence in phenotipically similar, but ethnically different families. However a strict correlation mutation-phenotype in patients is not always verified. Here we propose a statistical approach aimed to contribute in the estimation of the pathogenic variation sites. The analysis is based on the estimation of site-specific relative variability in a sets of homologous sequences, through the application of SiteVarProt and SiteVariability softwares, in order to infer a correlation between site variability and pathogenicity of a given mutation.

74. Di Vincenzo L, Grgurina I, Pascarella S
Computational analysis of structural properties of classical and novel non ribosomal aminoacyladenylate forming domains.
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Nonribosomal peptide synthetases (NRPSs) are multidomain, multifunctional enzymes involved in the biosynthesis of many bioactive microbial peptides such as phytotoxins, siderophores, biosurfactants, and anticancer agents. The minimal module required for a single monomer addition consists of a condensation domain (C), an adenylation domain (A) and a peptidyl carrier protein (PCP) domain also denoted as thiolation (T) domain. Systematic comparative analyses identified 8 or 10 sequence positions lining the active site pocket which are held responsible for substrate recognition and selection in A domain. Recently, it has been pointed out that several enzymes possibly involved in lysine metabolism in eucaryotes display a 3-domain architecture where the two N-terminal domains are homologous to the A and T domains from NRPS systems. The third C-terminal section may contain a PQQ, a NADPH or a functionally uncharacterized domain. Our work is aimed at the structural characterization and the study of common molecular features of the family of the aminoacyladenylate-forming enzymes from NRPS and from the recently discovered homologous enzymes. Psi-BLAST searches were applied over the GeAll and Non-Redundant databanks using query sequences Ebony (gi:3286766) from Drosophila melanogaster, 5-aminoadipic acid synthase (gi:30348962) from Mus musculus and aminoadipate-semialdehyde dehydrogenase from yeast (swissprot:LYS2_YEAST). Thirty-two sequences were identified from different eucaryotic species and the domain assignments were confirmed by CDD and Pfam queries. The sequence subsets containing the A-T domains were aligned utilizing the HMMER package. On the basis of the structural homology encoded in this multiple alignment, the potential occurrence of a “specificity code” similar to that described for the NRPS systems has been tested. The residues which interact with the α-amino and α-carboxy groups of the amino acid substrates [2], Asp235 and Lys517 respectively, are conserved, the only exceptions being Ebony protein (gi:3286766) from Drosophila melanogaster and (gi:21291643) from Anopheles gambiae where the Asp235 is replaced by valine. Homology molecular modelling has been utilized to map the conserved residues onto a hypothetical active site structure of the 5-aminoadipic acid synthase from Homo sapiens (gi:32261239) and Ebony (gi:3286766) from Drosophila melanogaster to understand the role of the conserved residues and to predict their interaction with the putative substrates. In case of Ebony proteins, the Asp235 is replaced by Val, while Pro236, conserved in all 5-aminoadipic acid synthase and aminoadipate-semialdehyde dehydrogenase, is substituted by Asp which can form hydrogen bond with the β-amino group of the β-alanine substrate. The β-amino group interacts via hydrogen bonds also with Ser301 and Asp331. The other residues line and shape the active site pocket. Characterization of the α-aminoadipate synthase is under way.

75. Ceroni A, Frasconi P
On the Role of Long-Range Dependencies in Learning Protein Secondary Structure
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Prediction of protein secondary structure (SS) is a classic problem in computational molecular biology and one of the first successful applications of machine learning to bioinformatics. Most available prediction methods use feedforward neural networks whose input is the multiple alignment profile in a sliding window of residues centered around the target position. By construction, predictions obtained with these methods are local. Long-range dependencies, on the other hand, clearly play an important role in this problem. In it was proposed the use of bidirectional recurrent neural networks (BRNN) for the prediction of SS. The architecture in this case allows us to process the sequence as a whole and to “translate” the input profile at each position into a corresponding output prediction for that position. Theoretically, the output at any position in a BRNN depends on the entire input sequence and thus a BRNN might actually exploit long-range information. Unfortunately, well known problems of vanishing gradients do not allow us to learn these dependencies. In this paper, we are interested in developing an architecture that can effectively exploit long-range dependencies assuming some additional information is available to the learner. We start from a rather simple intuitive argument: if the learner had access to information about which positions pairs are expected to interact, its task would be greatly simplified and it could possibly succeed. In the case of SS prediction, a reasonable source of information about long-range interaction can be obtained from contact maps (CM), a graphical representation of the spatial neighborhood relation among amino acids. Of course in order to obtain a CM the protein structure must be known. In addition, it is well known that backbone atoms’ coordinates can be reconstructed starting from CMs. Thus, in a sense, using CM information in order to predict SS might appear foolish since most of the information about the 3D structure of the protein is already contained in the map. However, the following considerations suggest that this setting is worth investigation: • Algorithms that reconstruct structure from CMs are based on a potential energy function with many local minima whose optimization is not straightforward. Thus it is not clear that a supervised learning algorithm can actually learn to recover SS from CMs. • CMs can be predicted from sequence or can be obtained from structures predicted by ab-initio methods such as Rosetta. Although accuracy of present methods is certainly not sufficient to provide a satisfactory solution to the folding problem, predicted maps may still contain useful information to improve the prediction of lower order properties such as the SS. • Even if CMs are given, the design of a learning algorithm that can fully exploit their information content is not straightforward. For example, Meiler and Baker have shown that SS prediction can be improved by using information about inter-residue distances. Their architecture is a feedforward network fed by average property profiles associated with amino acids that are near in space to the target position. In this way, relative ordering among neighbors in the CM is discarded. The solution proposed in this paper is based on an extended architecture that receives as an additional input a graphical description of the pairwise interactions between sequence positions. We call this architecture interaction enriched BRNN (IEBRNN). Its details are presented in a longer version of this paper.

76. Accardo MC, Giordano E, Riccardo S, Digilio FA, Iazzetti G, Calogero RA, Furia M
RNomics: a computational search for box C/D snoRNA genes in the D.melanogaster genome
Meeting: BITS 2004 - Year: 2004
Full text in a new tab
Topic: Unspecified

Abstract: Genes producing functional RNAs rather than protein products form a large and variegated class in all genomes, from bacteria to mammals. In higher organisms. non-coding RNA (ncRNA) appears to dominate the whole genomic output, and is not surprising that the range of known RNA-induced phenomena is rapidly expanding. The central importance of RNA signaling to eukaryotic cell has become apparent in the last few years, when a large bulk of evidence has pointed out novel roles for ncRNA molecules in both genetic and epigenetic processes. The family of nc-RNA genes comprises many small nucleolar RNAs (snoRNAs) that guide the maturation or post-transcriptional modification of target RNA molecules. Most snoRNAs fall into two classes called box C/D and box H/ACA snoRNAs, with each class defined by the presence of common sequence motifs and common associated proteins. A few snoRNAs in either class are required for definite pre-rRNA cleavages and essential for viability, whereas most are responsible for the 2’-O-ribose methylation (C/D) or pseudouridylation (H/ACA) of target RNA molecules respectively. The C/D class guides site-specific 2’-Oribose methylation by base-pairing of the 10-21 nt-long sequence positioned upstream from a D (or an internal D’) box to the target RNA, with the nucleotide positioned 5 base pairs (bp) upstream from the D/D’ box selected for methylation. Although most of the C/D and H/ACA box snoRNAs are involved in modifications of ribosomal RNA (rRNA), other types of RNA molecules, as tRNAs, snRNAs, and possibly mRNAs, might be recognised as targets. Despite the importance of their functional roles, most of snoRNAs have not yet been identified even in organisms whose genome has been completely sequenced.



BITS Meetings' Virtual Library
driven by Librarian 1.3 in PHP, MySQLTM and Apache environment.

For information, email to paolo.dm.romano@gmail.com .